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CHAPTER  ONE 
OVERVIEW  OF  REPORT 


1.0  INTRODUCTION 


By  exploiting  the  parallel  nature  of  matrix  computations,  a 
system  configuration  has  been  proposed  for  performing  the  basic 
matrix  computations  of  matrix  addition,  subtraction,  scalar 
multiplication,  matrix  multiplication,  and  matrix  inversion. 
Algorithms  have  been  developed  for  partitioning  the  matrix 
structure  on  the  basis  of  this  inherent  parallelism.  A  single 
instruction  multiple  data  s t r e  am  ma  ch i ne  ,  the  Ma r t i n - J one s -Hughe s 
Version  One  (the  MJH1)  has  been  developed  for  the  efficient 
allocation  of  computational  tasks  in  a  parallel  processing 
environment  as  prescribed  by  specific  matrix  operation 
partitioning  algorithms. 


The  concept  of  processing  data  in  parallel  arose  from  the 
need  to  decrease  computation  time  without  decreasing  the  amount 
of  computed  data  [lj.  Novel  architectures  have  been  developed, 
i.e.,  Staran,  Cray-1,  etc.,  in  order  to  capitalize  on  the 
concept.  These  architectures  facilitate  the  strategy  of 
distributing  data  evenly  to  a  pr e -de t e rmi  ned  number  of  processing 
elements  (PE’s)  operating  concurrently  (in  parallel)  as  opposed 
to  sequentially  (one  after  the  other)  and  then  routing  all  of  the 
intermediate  results  to  a  specified  location  for  further  PE 


Page  2 

distribution,  if  required.  This  method  of  computing  data  is  both 
efficient  and  effective. 

This  idea  permeates  the  concept  of  the  parallel  machine 
realized  in  this  report.  The  MJH1  is  a  multiprocessor  that  is 
capable  of  partitioning  matrices  for  specified  matrix  operations. 
After  successfully  partitioning  and  allocating  the  matrices,  the 
computation  can  be  performed  in  a  minimum  amount  of  time. 

Traditionally,  the  von  Neumann  computer  has  been  the  model 
for  all  computers.  However,  due  to  rapid  innovations  in 
technology,  processing  speed  has  become  a  factor.  Since  the  von 
Neumann  machine  is  a  sequential  control  flow  machine, 
computations  must  be  done  in  series,  even  though  they  may  be  done 
quicker  by  dividing  the  overall  job  among  various  sections  of  a 
machine’s  memory.  Many  novel  parallel  architectures  have  been 
invented  because  of  the  sluggishness  of  the  von  Neumann-like 
machine . 

The  concept  of  parallelism,  or  concurrency,  yields 
efficiency.  In  circuit  theory,  circuits  can  be  solved  by 
defining  currents  via  mesh  or  loop  equations.  The  unknown 
currents  may  be  solved  by  using  Cramer’s  Rule,  an  implementation 
of  the  matrix  structure.  The  loop  method  is  a  very  general 
method  for  defining  specific  unknowns.  It  follows  that  loop 
equations  can  be  implemented  to  describe  parallelism.  Loop 
equations  can  be  used  to  solve  matrices.  Hence,  matrices  exhibit 
some  inherent  parallelism.  Because  this  is  true,  matrix 


operations  are  easily  adaptable  to  parallel  architectures  in 


order  to  perform  matrix  computations.  Because  of  the  inherent 
parallelism  of  the  solution  techniques,  matrices  can  be  split  up 
or  partitioned  among  several  processors,  operations  can  be 
performed,  and  then  their  results  written  back  to  some  central 
locat ion  . 

1.1  REPORT  OBJECTIVE 

The  objective  of  this  research  was  to  develop  an  emulator 
tool  for  verifying  the  matrix  partitioning  algorithms  developed 
by  Sadhasivan  [2].  The  emulator  that  performs  these  operations 
is  the  MJH1  [3] . 

1.2  REPORT  ORGANIZATION 

An  analysis  of  the  parallelism  exhibited  in  the  matrix 
multiplication  operation  is  presented  in  this  report.  Algorithms 
developed  for  efficient  partitioning  of  the  matrix  structure  and 
the  allocation  of  parallel  computational  tasks  in  a 
multiprocessing  environment  will  be  simulated.  Chapter  Two 
discusses  various  system  architectures  and  their  relative 
interconnection  networks.  Chapter  Three  defines  the  partitioning 
algorithms  used  to  perform  the  various  matrix  computations. 
Chapter  Four  expounds  on  the  specifics  of  the  multiple  processor 
architecture,  including  the  control  processing  unit,  the 
arithmetic  processors  (PE’s)  and  the  I/O  capabilities.  Also,  the 
emulator  itself  will  be  discussed.  Finally,  Chapter  Five 
includes  a  summation  of  the  work  done  to  date,  as  well  as 
suggestions  for  future  wo t k . 


CHAPTER  TWO 


SYSTEM  ARCHITECTURE 


2.0  INTRODUCTION 


The  idea  of  connecting  N  computers  together  to  form  one  big 
super  computer  is  the  basic  idea  behind  parallel  processing  [4]. 
If  realized,  the  throughput  potential  of  N  computers  can  be  N 
times  that  of  one  single  computer.  Thus,  processing  data  in 
parallel  yields  great  "computing  power,”  whicli  can  approach  a 
linear  gain  in  overall  computation  speed.  Thus,  the  idea  of 
dividing  and  conquering  yields  faster  throughput  [5,6].  A  basic 
hardware  model  of  concurrent  or  parallel  processing  systems 
foil  ows  . 


I  P  I  I  P  I 


INTERCONNECTION 

NETWORK 


Figure  2 . 1 

HARDWARE  MODEL  OF  CONCURRENT  PARALLEL  SYSTEMS 
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Parallel  computer  architectures  can  be  grouped  into  four 


distinct  classes  [7,8].  These  four  classes  are  distinguished  by 
the  parallelism  afforded  by  the  instructions  coupled  with  the 
parallelism  afforded  by  the  data.  Instructions  or  instruction 
streams  can  be  distinguished  in  the  same  manner.  That  is,  data 
streams  can  either  be  singular  or  multiple. 

The  classic  von  Neumann  machine  is  classified  as  a  Single 
Instruction  Stream  Single  Data  Stream  (SISD)  machine.  A  block, 
diagram  of  the  SISD  architecture  is  pictured  in  Figure  2.2.  This 
sequential  machine  can  only  execute  one  instruction  on  one  piece 
of  data  at  one  time.  Figure  2.3  shows  the  Array  Processor  or  the 
Single  Instruction  Stream  Multiple  Data  Stream  (S1MD)  machine 
which  executes  one  instruction  which  acts  on  many  pieces  of  data 
simultaneously.  Usually,  these  data  are  located  in  arit  hme  t i c 
processors  (APs)  or  in  processing  elements  (PE’s).  These 
processors  that  contain  the  data  are  controlled  by  some  type  of 
control  processor.  This  architecture  class  is  easily  adaptable 
to  performing  computations  that  can  be  broken  into  a  series  of 
vector  operations,  i.e.  matrix  operations.  Perhaps  the  rarest 
configuration  in  practice  is  the  Mu  Itiple  Instruction  Stream 
Single  Data  Stream  (MISD)  machine  as  shown  in  Figure  2.4.  This 
machine  exhibits  parallelism  in  the  instruction  stream  only  and 
allows  several  instructions  to  be  executed  on  the  same  datum. 
The  fourth  and  final  class  of  parallel  computer  architectures  is 
the  Multiple  Instruction  Stream  Multiple  Data  Stream  (MIMD) 
machine.  This  configuration  allows  for  parallelism  in  both 
streams,  instruction  and  data.  A  block  diagram  of  MIMD 
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architecture  is  in  Figure  2.5. 
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MI SD  BLOCK  DIAGRAM 
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Of  the  four  classes  of  architectures,  two  can  be  used  to 
realize  parallelism  with  respect  to  data  manipulations.  These 
two  are  SIMD  and  MIMD  machines. 


In  the  class  of  SIMD  machines,  we  distinguish  between  vector 
and  array  computers.  This  distinction  is  based  primarily  on  the 
way  data  are  communicated  to  elements  of  the  system.  Processors 
in  array  computers  typically  access  data  from  their  own  memories 
and  those  of  their  nearest  neighbors  (through  some  type  of  shift 
network).  Vector  computation  has  been  supported  through 
pipelining  or  streaming  and  by  synchronous  multiprocessing. 


Figure  2.6  summarizes  this  classification  of  parallel  computers. 
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2.1  SIMD  ARCHITECTURE 


The  Array  Processor,  Single  Execution  Array,  or  SIMD  array 
is  a  machine  that  executes  one  instruction  per  array  of  data. 
Distinguishing  architectural  features  include:  a  control  unit 
that  houses  a  local  memory  and  can  broadcast  one  single 
instruction  to  all  processing  elements  (arithmetic  processors);  a 
predefined  number  of  processing  elements;  processor  memories  (one 
per  processing  element);  a  c onxnun i c a t i on  system  among  processors 
and  external  data  sources. 

The  main  memory  contains  the  combined  memory  of  the  N 
processor  elements  (PE’s)  and  all  of  the  instructions  are  stored 
there.  These  are  connected  via  a  high  speed  data  bus  with  a 
bandwidth  N  times  that  of  the  individual  memories  and  compatible 
with  both  the  processor  and  the  I/O  bandwidths. 

The  number  of  available  arithmetic  processors  and  the  number 
of  computations  required  per  processor  for  the  complete  execution 
of  the  computation,  determine  the  instruction  stream  broadcast  to 
the  arithmetic  processors  by  the  control  processor.  In  other 
words,  the  system  configuration  determines  the  broadcasted  stream 
of  information  to  the  enabled  parocessors.  The  processor  must 
know  prior  to  runtime  bow  the  system  is  configured. 

The  instructions  are  fetched  from  the  memory  in  blocks  into 
a  buffer.  These  instructions  can  be  of  two  types:  control 
instructions  executed  by  the  control  unit,  or  vector  instructions 
executed  by  the  arithmetic  processor.  A  listing  of  the 
instructions  for  the  MJH1  is  included  in  a  subsequent  chapter. 


m 
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Instructions  of  two  different  kinds  leads  to  processing 
idling  wii  i  c  h  results  in  the  inefficient  use  of  processing  times. 
Processing  idling  can  result  wrh en  the  controller  executes  a 
scalar  instruction,  and  the  arithmetic  processors  are  idle 
waiting  for  the  next  operation.  Alternatively,  when  computations 
require  more  than  one  cycle  of  parallel  processor  operation,  all 
the  processors  may  not  be  needed  for  the  last  cycle  of  operation. 
This  results  in  idle  processors.  Although  processor  idling  can 
occur,  program  execution  and  data  manipulation  in  parallel  is  by 
far  faster  than  the  sequential  execution  and  manipulation  of 
data. 


This  problem  can  be  alleviated,  however,  by  buffering  the 
instruction  stream.  That  is,  fetching  several  instructions  at 
once  and  then  distributing  them  evenly  and  efficiently; 
instruction  stream  pipelining  via  the  control  processor;  and 
overlapping  the  instruction  fetch  sequence  with  data 
manipulations  not  requiring  main  memory  usage. 

Since  there  is  only  one  control  processor,  the  processing 
elements  must  be  synchronized  by  that  single  processor.  There  is 
only  one  clock  per  machine,  one  timing  signal  or  clock  controls 
everything.  This  makes  conditional  branching  to  route  data  past 
certain  idle  processors  virtually  impossible. 

PE  routing  is  achieved,  instead,  via  the  control  processor 
testing,  setting,  or  resetting  the  mask  bits  associated  with 
each.  Various  PE’s  only  receive  the  broadcasted  instruction  or 
data  if  they  are  enabled,  i.e.  if  their  mask  bits  are  high. 
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Mask  bits  are  set  at  the  completion  of  an  instruction  fetch 
cycle,  prior  to  the  execution  of  the  next  instruction.  Mask  bits 
can  only  be  set  or  reset  by  the  control  processor.  However,  this 
mask  checking  slightly  degrades  the  parallelism  of  the  S1MD 
ma  chine  . 

2.2  I NTERCONNECT 1 ON  NETWORKS 

Access  and  i  n t e r p r o c e s s o r  c onmun i c a t i ons  are  major  problems 
in  the  programming  and  design  of  SIMD  machines.  A  number  of 
different  networks  have  been  proposed  to  achieve  fast,  efficient 
communications  at  a  reasonable  cost.  Some  of  these 
interconnection  networks  are  included  in  subsequent  sections  [9]. 

2.2.1  CYCLIC- SHI  FT  NETWORK 

Also  called  the  exchange  network  or  the  uniform  shift 
alignment  network,  the  cyclic-shift  interconnection  network 
involves  several  processors  connected  cyclically  to  facilitate 
bidirectional  data  transfer  between  adjacent  processors  in  a  unit 
shift  time  of  one  cycle.  A  cyclic  shift  is  achieved  by  a 
sequence  of  cyclic  shifts  of  a  unit  amount.  Between  processors  i 
and  j  there  are  ((i-j-l)  mod  p)  processors  through  which  the  data 
must  be  shifted  while  moving  from  processor  i  to  processor  j. 

The  main  advantage  of  the  cyclic  interconnection  is  the 
constant  number  of  connections  it  supports  per  processor  in  an  N 
processor  system.  Every  processor  is  connected  to  its  adjacent 
neighbor.  However,  the  network  is  very  inefficient  for 


algorithms  requiring  noncyclic  data  permutations 


and  may 
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sometimes  require  N  mask-and-shifts  involving  N  alignment  cycles 
to  route  data  to  the  appropriate  processor  for  further 
manipulation.  An  illustration  of  this  interconnection  follows  in 
Figure  2.7. 


2.2.2  CROSSBAR  NETWORK 


This  interconnection  network  in  its  simplest  form  can  be 
positioned  between  the  source  units  and  destination  units  to 
implement  the  data  movement  in  one  transfer  instruction.  This 
network,  known  for  its  simplicity,  is  an  N  by  N  array  of 
switches,  with  the  N  source  units  connected  to  the  rows  and  the  N 
destination  units  connected  to  the  columns.  The  network  is 
bidirectional.  Each  processor  and  memory  has  an  input  and  output 
connection  to  the  network  and  simultaneous  conflict-free 
connections  from  any  source  to  any  destination  for  a  one-to-one 
mapping  . 

A  major  disadvantage  of  the  crossbar  network  is  that  its 
size  grows  by  N  squared  with  the  subsequent  addition  of  PE's. 
This  increase  is  quadratic  in  nature  and  is  not  compatible  to  the 
typical  linear  increase  found  in  parallel  processing.  Thus,  the 
crosbar  network  is  best  suited  for  systems  that  confiure  a  small 
number  of  source  and  destination  units.  A  block  diagram  of  the 
crossbar  network  follows.  This  configuration  consists  of  four 
processing  elements. 


FIGURE  2.8 
CROSSBAR  NETWORK 


Page  16 


2.2.3  PERFECT  SHUFFLE  NETWORK 

This  network  derives  its  name  from  the  shuffling  of  a  deck 
of  cards.  The  basic  idea  involves  splitting  N  processors  into 
equal  halves  and  interlacing  them.  The  ith  processor  of  the 
unshuffled  deck  is  bidirectionally  connected  to  the  ith  processor 
of  the  shuffled  deck.  The  perfect  shuffle  connection  pattern 
routes  data  from  position  P  to  position  P*. 

The  above  mentioned  interconnection  networks  are  typical  in 
many  SIMD  machines.  However,  processors  may  be  dedicated  to 
highly  specialized  algorithms,  and  their  interconnections  may  be 
designed  specifically  for  the  implementation  of  these  algorithms, 
and  may  be  different  from  the  ones  mentioned  in  this  thesis. 
SIMD  machines  tend  to  exhibit  more  speed  of  computations.  This 
is  primarily  due  to  the  fact  that  SIMD  machines  do  not  require  as 
much  synchronization,  task  scheduling,  and  system  software  as 
does  the  MIMD  machine.  Also,  high  reliability  of  the  SIMD 
configurations  can  be  attributed  to  the  redundancy  of  the  data 
manipulations . 


FIGURE  2.9 


PERFECT  SHUFFLE  NETWORK 


2 . 3  MIMD  MACHINES 


As  stated  before,  MIMD  machines  are  computers  with  one  or 
more  general  purpose  processor,  each  capable  of  executing  a 
separate  stream  of  instruction  on  a  stream  of  data  that  is  housed 
in  central  memory.  In  essence,  the  advantage  of  the  MIMD  machine 
is  its  capability  to  share  memory,  I/O  units,  and  several 
computing  units.  Because  MIMD  architecture  is  centered  around 
the  concept  of  shared  resources,  MIMD  machines  are  broadly 
classified  into  two  categories:  tightly  coupled  systems,  and 
loosely  coupled  systems. 

Tightly  coupled  systems  consist  of  several  centrally  located 
processors  that  share  the  same  memory  and  data  structures  and  are 
supervised  by  a  single  operating  system.  Loosely  coupled  systems 
are  comprised  of  independent  computer  systems  that  are  physically 
distributed  at  several  locations  and  supervised  by  a  distributed 
operating  system.  Each  processor  has  its  own  system  software  and 
data  structures,  and  shares  a  common  data  base  with  the  other 
systems  via  slow  speed  communication  lines. 

There  are  two  methods  of  managing  multiprocessor  systems. 
One  approach  involves  a  hierarchical  organization  with  functional 
division  amongst  the  processors  and  supervisor  controlling  the 
specialized  functional  units.  An  alternate  approach  involves  a 
network  of  independent  computing  systems  communicating  with  each 
other  on  an  equal  basis.  In  both  methods,  the  individual 
processors  execute  the  individual  instruction  streams  assigned  to 
them  sequent i a  1 ly . 
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Since  the  MIMD  system  is  more  flexible  than  the  SIMD 
machine,  it  is  more  suitable  for  a  large  class  of  computations. 
However,  flexibility  comes  with  a  price.  That  price  is 
synchronization  and  allocation  problems.  The  partitioning  of  the 
physical  problem  into  several  processes  that  can  be  executed  in 
parallel  is  of  major  concern  in  the  MIMD  machine.  However,  MIMD 
systems  are  configured  for  ease,  overall  reliability,  with 
functional  specialization  for  overall  throughput.  Despite  this, 
the  problems  of  resource  allocation,  synchronization,  and  problem 
partitioning  will  not  allow  the  realization  of  MIMD  computers 
with  several  processors.  Coupling  refers  to  the  ability  of  the 
various  processors  to  share  resources.  Figure  2.10  shows  a 
b  r  e  a  k  d own  of  the  MI MD  class  of  parallel  c  omp  uter  . 


MIMD 


TIGHTLY 

COUPLED 


LOOSELY 

COUPLED 


MULTIPROCESSOR 

SYSTEMS 


COMPUTER 

NETWORKS 


FIGURE  2.10 


MIMD  MULTIPLE  PROCESSOR  ORGANIZATION 


2.4  SYSTEM  ARCHITECTURE  FOR  MATRIX  COMP UT AT  I  ON  S 


The  system  realized  via  this  thesis  is  designed  specifically 
for  the  solution  of  various  matrix  computations.  The  SIMD 
configuration  was  chosen  to  satisfy  the  special  purpose 
requirements  for  the  matrix  operations  of  solution  of 


simultaneous  equations  and  matrix  transpose.  All  these 
operations  can  be  done  in  parallel. 

The  system  architecture  configures  a  system  of  K  arithmetic 
processors,  each  equipped  with  local  memories  interconnected  in  a 


bidirectional  cyclic-shift  fashion 


Each  processor  i s 


individually  connected  to  the  control  processor,  and  to  central 
memory.  Communication  to  the  controller  is  achieved  via 
handshaking  signals.  To  avoid  bus  contention,  the  data  required 
for  the  computations  are  assigned  to  each  processor’s  local 
memory  during  the  initial  transfer  of  data,  prior  to  runtime. 
The  controller  also  broadcasts  the  intended  operation  to  be 
performed  along  with  the  data  values.  What  is  not  broadcast,  is 
the  operand  addresses  since  the  processors  execute  the  specified 
operations  on  the  data  that  is  stored  from  a  specific  starting 
address  in  their  local  memories.  To  further  avoid  bus 
contention,  the  controller  monitors  the  state  of  each  arithmetic 
processor  and  keeps  track  of  the  number  of  computations  tp  be 
performed  by  each  processor.  The  controller  can  test  and  set 
mask  bits  in  order  to  shut  out  idle  or  unnecessary  processors. 
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Interconnection  between  arithmetic  procestors  involving  data 
transfer  is  as  follows.  Data  to  be  transferred  is  put  onto  the 
shift  port.  Then,  through  unit  cyclic  shifts,  the  data  is  passed 
from  processor  to  processor  until  an  enabled  (or  strobed) 
processor  receives  the  data  from  the  port  and  puts  it  into  its 
shift  memory.  The  controller  strobes  the  appropriate  destination 
processor.  Note  that  the  controller  only  has  access  to  the 
computed  results  at  the  end  of  all  task  executions. 

The  allocation  algorithms  reside  in  the  main  memory  of  the 
controller.  The  algorithm  is  system  based,  so  once  the  number  of 
arithmetic  processors  in  the  configuration  has  been  defined,  the 
data  allocation  and  data  transfer  operations  are  expressed  as 
only  a  function  of  the  input  matrix  dimensions. 

The  system  architecture  discussed  above  is  implemented  in 
software  by  using  Pascal.  The  data  allocation  and  control 
algorithms  and  the  specific  details  of  the  system  hardware  and 
related  software,  implementing  the  synchronization,  execution  and 
coordination  of  the  various  matrix  computations  in  parallel,  are 
explained  in  detail  in  the  chapter  that  expounds  on  the  MJH1 . 

In  conclusion,  SIMD  and  MIMD  systems  are  the  most  prevalent 
architectures  among  parallel  machines  today.  It  has  been  noted 
that  SIMD  systems  are  well  suited  for  matrix  computations  due  to 
the  inherent  parallelism  found  in  matrices.  This  is  true  because 
there  is  not  a  synchronization  problem  like  the  one  found  in  MIMD 
architectures,  since  all  PE's  are  controlled  by  the  processor. 
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However,  in  SIMD  systems,  the  system  software  ignores  the 
parallel  architecture  and  reveals  it  to  the  user,  allowing 
maximum  benefit  without  any  system  software  overhead.  This  in 
turn  passes  the  burden  of  programming  required  to  exploit  the 
inherent  parallelism  to  the  user. 
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CHAPTER  THREE 

PARTITIONING  ALGORITHMS  FOR  MATRIX  COMPUTATIONS 

3.0  INTRODUCTION 

In  this  chapter,  the  actual  algorithms  for  partitioning  the 
matrix  structure  for  basic  matrix  computations  are  presented. 
Mo  re  specifically,  the  partitioning  algorit  hms  for  matrix 
addition/subtraction  and  scalar  mu  ltiplication  (each  share 
identical  algorithms),  matrix  multiplication,  and  matrix 
inversion  are  6tated  as  allocation  theorems.  These  theorems  were 
developed  and  proven  by  Sadhasivan.  For  a  more  indepth  study  of 
the  actual  algorithms,  including  theorems,  corollaries,  and  their 
proofs,  the  author  refers  the  reader  of  this  document  to  the 
thesis  written  by  Sadhasivan  [2]. 

The  allocation  theorems  for  each  basic  matrix  computation 
will  merely  be  stated  in  this  chapter  for  clarification  purposes, 
as  well  as  for  completeness.  They  are  the  basis  for  the  proposal 
of  performing  matrix  computations  on  an  emulated 
mi  c r opr oc e s 8 o r - ba s e d  SIMD  multiprocessor. 

3.1  PARTITIONING  ALGORITHMS  FOR  MATRIX  ADDITION/ SUBTRACTION  AND 
SCALAR  MULTIPLICATION 

Let  the  matrices  involved  in  these  computations  be. 
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M  x  N 


M  x  N 


M  x  N 


Note  that  m  is  the  n  umb  e  r  of  r  ows  and  n  is  the  n  umb  er  of 
columns  of  [A],  [B],  and  [C].  Assuming  the  number  of  arithmetic 
processors  to  be  k,  the  partitioning  algorithms  can  be  defined 
for  the  different  occurances  of  parallelism  in  the  matrix 
structure  and  its  operations  as  sh own  b e 1  ow . 

l.a.  If  m-k ,  then  the  allocation  of  [A],  [B],  matrix  values  are 
done  according  to  the  following  theorem: 

Theorem:  One  row  of  [A]  and  the  corresponding  row  of  [B]  are 

allocated  to  each  processor. 

1. b.  If  m  is  a  multiple  of  k  ,  then  the  allocation  of  matrix 

values  is  done  according  to  the  following  theorem. 

Theorem:  (m/k)  rows  of  (A]  in  order,  and  corresponding  rows  of 
[B]  are  allocated  to  each  processor. 

2.  a.  If  n*k,  then  the  allocation  of  matrix  values  is  done 

according  to  the  following  theorem. 

Theorem:  One  column  of  [A]  and  the  corresponding  column  of  [B] 
are  allocated  to  each  processor. 

2.b.  If  n  is  a  multiple  of  k,  then  the  allocation  of  matrix 
values  is  done  according  to  the  following  theorem. 

Theorem:  (n/k)  columns  of  [A)  in  order,  and  the  corresponding 
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columns  of  [B]  are  allocated  to  each  processor. 

3.  If  t«(mn)  and  if  t>0  then, 
a.  ift-kortisa  submultiple  of  k  and, 
i.  if  m<«n,  then  the  allocation  of  the  input  matrix  values 
is  done  according  to  the  following  theorem. 

Theorem:  For  i-1  to  m 

{  for  j-1  to  n 

{(ni+j-n)th  processor  is  allocated  the  values  A[i,j] 
and  B[i,j],  to  compute  the  result  matrix  value 
C[  i  .  j  ]  . 

}  /•  for  j  •/ 

}  /  *  for  i  *  / 

ii.  If  m>n,  then  the  allocation  is  done  according  to  the 
following  theorem. 

Theorem:  For  j-1  to  n 

{  for  i-1  tom 

{ Cm  j  +  i -m)  t h  processor  is  allocated  the  values  of 
A[i,j]  and  B[i,j]  to  compute  the  result  matrix 
value  C[ i . j ] . 

}  /  *  for  i  *  / 

}  /*  for  j  •/ 

3.b.  If  t  is  a  multiple  of  k,  such  that  t-jk  and, 

i.  if  m<-n  i.e.  m  is  a  submultiple  of  k  such  that  k-qm,  then 
the  allocation  of  [A]  and  [B]  values  is  done  column  wise 
according  to  the  following  algorithm. 

Theorem:  For  i-1  to  q 

{  for  r- 1  tom 

{for  c-0  to  C  j -  1  ) 
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{  ((i-l)m+r)th  processor  is  allocated  the 
A[r,(cq+i)]  and  B[r,(cq+i)]  values  to 
compute  the  result  matrix  value  C[r,(cq+i)] 

}  /  *  for  c  *  / 

}  /  *  for  r  *  / 

}  /•  for  i  */ 

ii.  If  m>n ,  i.e.  n  is  a  submultiple  of  k  such  that  k-qn 
( m« jq),  then  the  allocation  of  [A]  and  [B]  values  are 
done  rowwise  according  to  the  following  theorem. 

Theorem:  For  i«l  to  q 

{  for  c«  1  to  n 

{  for  r  «  0  to  (j-1) 

{  ((i-l)n+c)th  processor  is  allocated  the 

A[(rq+i),c]  and  B[(rq+i),c]  values  to  compute 
the  result  matrix  value  C[(rq+i),c] 

}  /  *  for  r  •/ 

}  /  *  for  c  *  / 

}  /*  for  i  */ 

4.  If  t  is  not  related  to  k,  then  letting  j-lt/kl  and, 
a .  i f  m  i s  a  s  u bmu  ltiple  of  k  such  that  k  «mq  and  letting  i  —  j  q 
(also  note  that  if  t>k  then  n>q),  the  allocation  of  [A]  and 
[ B ]  values  are  done  according  to  the  following  theorem. 
Theorem:  For  1-1  to  (n-l) 

{  for  r - 1  tom 

{  ((l-l)m+r)th  processor  is  allocated  the  A[r,l]  and 
B [ r  .  1  ]  values  and  , 
if  t  >k  then 
For  c-1  to  j 


r  I 

m 
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{A[r,(cq+l)J  and  B[r,(cq+l)]  values  are  also 
allocated  to  the  same  processor 
}  /»  for  c  */ 

to  compute  the  corresponding  values  of  the  result  matrix  [C] 
}  /  *  for  i  *  / 

}  /*  for  1  •/ 
if  t  >k  then 
for  1  -  ( n  -  i  + 1  )  to  q 
{  for  r - 1  tom 

{  for  c-0  to  (j-1) 

{  ( ( 1  - 1  )m+  r)th  processor  is  allocated  the 
A[r,(cq+l)]  and  B[r,(cq+l)]  values  to 
compute  the  result  matrix  value  C[r,cq+l)] 
}  /*  for  c*/ 

}  /•  for  r  •/ 

}  /•  for  1  •/ 


4.b.  If  n  is  a  submultiple  of  k  such  that  k-qn,  and  letting  i**jq 
(if  t>k,  m>q),  the  allocation  of  [A]  and  [B]  matrix  values 
are  done  according  to  the  following  theorem. 


Theorem:  For  1-1  to  (m-l) 
{  for  c - 1  to  n 


(  ((l-l)n+c)th  processor  is  allocated  the  A[l.c]  and 
B[l,c]  values  and, 
if  t  >k  then, 
for  r-1  to  j 

{  A[(rl+l),c]  and  B[(rq+l),c]  values  are  also 
allocated  to  the  same  processor 


1 
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}  /*  for  r  •/ 

to  compute  the  corresponding  values  of  the  result 
matrix  [C] 

}  /»  for  c  •/ 

}  /»  for  1  »/ 

if  t  >k  then 

for  1 - (m- i + 1 )  to  q 


for 

c  - 1  to 

n 

{  for  r-0 

t  0 

(j- 

1) 

{ 

((l-l)n+c)th 

processor  is  allocated 

A[  (  rq+ 1  )  , 

C  ] 

and  B[ ( rq+  1  )  ,  c  ] 

) 

/*  for 

r 

*/ 

} 

/•  for 

c 

»/ 

} 

/*  for 

1 

*/ 

If  the  total  number  of  elements  of  [C]  to  be  computed, 
t  -  mn,  is  a  totally  random  value,  then  letting  q-lt/kl 
and  r-qk,  the  partitioning  and  allocation  of  the  input 
matrix  values  can  be  done  according  to  the  following 
theorem. 

The  positions  of  [A]  and  [Bj  matrix  values  are  first 
individually  modified  to  represent  them  as  linear  lists  such 
that,  the  value  A[i,j]  or  B[i,j]  occupies  the  (ni+j-n)th 
position  in  the  linked  lists. 

For  p-1  to  (t-r) 

{  processor  p  is  allocated  the  ptb  values  of  [A]  and  [B] 
from  their  linear  lists  and, 
if  t  >k  then 


for  i-1  to  q 
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l 

{  (ik+p)th  values  in  the  linked  lists  of  [A]  and  [ B ]  are 
also  allocated  to  it 
}  /  *  for  i * / 

to  compute  the  corresponding  values  of  the  result  matrix 
[C] 

}  /  *  for  p  *  / 
if  t  >k  then , 
for  p«(t-r+l)  to  k 
{  for  i-0  to  (q-l) 

{  pth  processor  is  allocated  the  (ik+p)th  values  of 
[A]  and  [ B ]  from  their  linear  lists  to  compute  the 
corresponding  values  of  the  [C]  linear  list. 

}  /  *  for  i  *  / 

}  /*  for  p  */ 

The  partitioning  algorithms  for  scalar  multiplication  of  an 
input  matrix  are  the  same  as  those  suggested  for  matrix 
addition/subtraction  except  that  now  instead  of  two  input 
matrices,  only  one  input  matrix  and  the  scalar  value  are  involved 
in  the  allocation.  The  computation  involved  in  this  case  is 
obviously  the  multiplication  operation  instead  of  addition  or 
subtraction  and  the  computation  time  of  the  result  is 
consequently  expressed  in  terms  of  multiplication  time  for  the 
evaluation  of  an  element  of  the  result  matrice  instead  of 
addition/subtraction  time  required  to  compute  the  same  element. 


3.2  PARTITIONING  ALGORITHMS  FOR  MATRIX  MULTIPLICATION 


For  the  discussion  that  follows,  let  the  matrices  involved 
in  the  computation  be  defined  as  shown  below. 


[A] 

[B] 

r 

[C] 
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- 
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Note  that  m  is  the  n  umb  e  r  of  r  ows  of  [A]  and  [ C ] ,  n  is  the 
number  of  columns  of  [A]  and  rows  of  [B],  and  p  is  the  number  of 
columns  of  [B]  and  [C] .  If  intercommunication  between  arithmetic 
processors  is  to  be  avoided  during  the  matrix  multiplication 
operation,  then  all  the  values  of  a  specific  row  of  [A]  and  all 
the  values  of  an  appropriate  column  of  [B]  required  to  compute  a 
product  element  of  the  result  matrix  [C]  ,  should  be  allocated  to 
a  single  processor,  to  compute  that  product.  The  evaluation  time 
of  each  element  of  the  product  matrix  [C] ,  is  referred  to  as  the 
computation  unit  time  throughout  our  discussion.  With  these 
assumptions  and  using  the  above  mentioned  representation  for  the 
matrix  structure,  the  partitioning  algorithms  for  multiplication 
can  be  defined. 


1  .  If  the  n umb er  of  values  of  the  result  matrix,  t  -  mp , 
is  equal  to  the  number  of  arithmetic  processors,  k  or 
is  a  submultiple  of  k,  then  the  allocation  of  [A]  and 
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[B]  values  is  done  as  s  h  own , 

Theorem:  One  row  of  values  of  [A]  and  one  column  of  values 

of  [B]  are  allocated  to  each  processor,  according 
to  the  following  theorem, 

Forr-ltom 
{  for  c-1  to  p 

{  ((r-l)p+c)th  processor  is  allocated  the 
rth  row  of  values  of  [A]  and  the  cth 
column  of  values  of  [B]  to  compute  the 
value  C[r,c]  of  the  result  matrix 
}  /  *  for  c  *  / 

}  /  *  for  r  *  / 

2.  a.  If  the  number  of  columns  of  the  product  mattrix,  p, 

equals  k,  then  the  allocation  is  done  as  follows, 

Theorem:  The  entire  matrix  [A]  and  one  column  of  values  of 

[B]  are  allocated  to  each  processor. 

b.  If  p  is  equal  to  a  multiple  of  k,  then  the  allocation 
is  done  as  foil ows  , 

Theorem:  The  entire  matrix  [A]  and  (p/k)  columns  of  values 

of  [ B ]  in  order,  are  allocated  to  each  processor. 

3.  a.  If  the  number  of  rows  of  matrix  [A],  m  equals  k, 

then  the  allocation  of  input  matrix  values  is  done 
according  to  the  following  theorem. 

Theorem:  One  row  of  values  of  [A]  and  the  entire  matrix 

[Bj  are  allocated  to  each  processor. 

b.  If  m  is  a  multiple  of  k,  then  the  allocation  of 

[A]  and  [B]  matrix  values  are  done  according  to  the 
following  theorem. 


3 

3 

3 

l 
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Theorem:  (m/k)  rows  of  values  of  ma  t r i i  [A]  and  the  entire 

[B]  matrix  are  allocated  to  each  processor. 

4.  If  t  -  mp ,  is  a  multiple  of  k,  t  -  jk  and, 

a.  if  mop  i.e.  m  is  a  submultiple  of  k,  k  ■  qm,  then  the 
allocation  [A]  and  [B]  values  is  done  as  foil ows  . 

Theorem:  For  r-1  to  m 

{  for  c-1  to  q 

{  ((r-l)q  +  c)th  processor  is  assigned  the  rth  r ow 
values  of  [A]  and  cth  string  of  j  columns  of 
[ B ]  in  order. 

}  /  *  for  c  *  / 

}  /  *  for  r  *  / 

b.  If  m>p  i.e.  p  is  a  submultiple  of  k  such  that  k-rp, 
then  the  result  matrix  [C]  is  computed  rowwise 
according  to  the  following  theorem, 

Theorem:  For  q-1  to  p 

{  for  c«  1  to  r 

{  ((q-l)r+c)th  processor  is  assigned  the 
cth  string  of  j  rows  of  values  of  [A] 
in  order  and  qth  column  of  values  of 
[B] 

}  /  *  for  c  •/ 

}  /  *  for  q  *  / 

5.  If  t  -  mp  is  >  k,  but  not  related  to  k  and, 

a.  if  the  number  of  columns  of  [A],  n  is  equal  to  k, 
then  the  allocation  of  [A]  and  [B]  values  is  done 
according  to  the  theorem  presented  next.  Note 
that  i n t e r pr oc e s s o r  communication  for  the  addition 
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of  partial  products  generated  in  each  of  these 
processors,  to  give  a  final  product  element  of  [ C ]  . 

Theorem:  One  column  of  values  of  [A]  and  one  row  of 

values  of  [B]  in  order,  are  allocated  to  each 
processor.  Each  processor  calculates  each  of 
the  n  or  k  partial  products  by  using  the  column 
of  values  of  [A]  and  relevant  r o w  of  values  of 
[B]  stored  in  it.  The  final  product  is 
obtained  by  adding  the  n  partial  products  by 
means  of  unit  shifts  between  the  processors 
foil  owe  d  by  parallel  additions  of  these 
partial  produc  t  s . 

b.  If  n  is  a  multiple  of  k,  the  allocation  of  [A] 

and  [B]  values  is  done  according  to  the  following 
theorem.  (Note  that  interprocessor  communication 


takes  place  through  the  cyclic  shift  interconnection 
provided  in  the  system  architecture). 

Theorem:  (n/k)  columns  of  {A]  and  (n/k)  rows  of  [ B ] 

in  order  are  allocated  to  each  processor, 
c.  If  n  is  a  submultiple  of  k  and, 
i.  the  number  of  elements  in  [A],  (mn)  is  a 

multiple  of  k,  then  the  allocation  of  [A]  and 
[B]  values  are  done  according  to  the  following 
theorem. 

Theorem:  For  r-1  to  n 

{  for  1-0  tp  (k/n-1) 


{  (ln+r)th  processor  is  assigned  (l+l)tb, 
(mn/k)  values  of  column  r  of  [A]  and 
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the  rth  row  of  values  of  [B] 

}  /*  for  1  */ 

}  /  *  for  r  *  / 

ii.  if  the  number  of  elements  in  [B],  (np)  is  a 

multiple  of  k,  then  the  allocation  of  [A]  and  [ B ] 
values  are  done  as  shown  below. 

Theorem:  For  c-1  to  n 

{  for  1-0  to  (k/n-l) 

{  (ln+c)th  processor  is  assigned  the  cth 
column  values  of  [A]  and  (l+l)th  string 
of  (np/k)  values  of  row  c  of  [ B ] 

}  /*  for  1  */ 

}  /  *  for  c  *  / 

d.  if  n  is  a  submultiple  of  k,  but  the  number  of 
elements  in  [A]  and  in  [B]  are  not  a  multiple 
of  k  but, 

i.  m«>p;  then  letting  j-lmn/kl  and  i-jk/n,  the 
allocation  of  [A]  and  [B]  values  is  done 
according  to  the  following  theorem. 

Theorem:  For  c-1  to  n 

{  for  1-0  to  (k/n- 1 ) 

{  (ln+c)th  processor  is  assigned  the 
(ln+l)th  string  of  (in/k)  values  of 
column  c  of  [A]  and  cth  row  of  values 
of  [B] 

}  /•  for  1  */ 

}  /  *  for  c  *  / 

Also  the  additional  values  of  [A]  are  allocated 
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such  that, 

for  r-( i +1 )  tom 
{  for  c-1  to  n 

{  ( ( r - (  i  +  1 )  )n  +  c  )  t h  processor  is 

allocated  the  cth  columa  value  of  the 
rth  row  of  £  A ] 

}  /  *  for  c  *  / 

}  /  *  for  r  *  / 

Each  final  product  is  obtained  by  shifting  and 
adding  the  n  partial  products  between  the  string 
of  n  processors.  For  synchronization  purposes, 
the  i  r ows  of  [C]  are  first  evaluated  col umn 
wise  after  which,  the  remaining  (m- i  )  rows  of 
[C]  are  computed. 

ii.  if  m<p,  then  letting  j-lnp/kl  and  i-jk/n,  the 
allocation  of  [A]  and  [B]  values  is  done 
according  to  the  following  theorem. 

Theorem:  For  c-1  to  n 

{  for  1-0  to  (k/n-l) 

{  (ln-fc)th  processor  is  assigned  the 
cth  column  of  [A]  and  (l+l)th  string 
of  (in/k)  values  of  row  c  of  [B] 


}  /*  for  1  •/ 

}  /•  for  c  •/ 

Also  the  additional  values  of  [B]  are  allocated 
such  that, 
for  c  —  ( i  +  1 )  to  p 
{  for  r-1  to  n 
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{  ( ( c - ( i+1 ) )n+r ) th  processor  is  allocated 
the  rth  row  value  of  cth  col umn  of  [ B ] 

}  /  *  for  r  *  / 

}  /  *  for  c  *  / 

Each  final  product  is  obtained  by  shifting  and 
adding  the  n  partial  products  between  the  atring 
of  n  processors.  For  s y nchron i ca t i on  purposes, 
the  i  columns  of  [C]  are  first  evaluated,  followed  by 
the  eveluation  of  the  remaining  Cp-i)  columns  of  [C)  . 
e .  i f  m  i s  a  s u bmu ltiple  of  k,  then  letting  j-l mp / k I 
and  i«(p-jk/m),  the  allocation  of  [A]  and  [ B ] 
values  is  done  according  to  the  following  theorem. 
Theorem:  For  s-1  to  i 

{  sth  string  of  m  processors  is  allocated 
m  r ows  in  order  of  [A]  and, 
for  c-0  to  j 

{  C  c  k  /m+  s  )  t  h  col umn  of  values  of  [ B ] 

}  /  *  for  c  *  / 

}  / *  for  s  *  / 

In  addition,  the  following  allocation  is  made, 
for  s«(i+l)  to  k/m  (if  s<p) 

{  sth  string  of  m  processors  is  allocated 
the  m  rows  of  [A]  in  order  and, 
for  c-0  to  (j-l) 

{  ( c k /m+  s ) t h  col umn  of  values  of  [B] 

}  /  *  for  c  *  / 

}  /•  for  s  •/ 

Note  that  interprocessor  cotsmun  i  ca  t  i  on 
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is  not  required  in  this  scheduling 
a  Igor i thm. 

f.  if  p  is  a  submultiple  of  k,  then  letting  j-lmp/kl 
and  i-(m-jk/p),  the  allocation  of  [A]  and  [B] 
values  can  be  done  according  to  the  following 
theorem. 

Theorem:  For  s-l  to  i 

{  sth  string  of  p  processors  is  allocated 
the  p  columns  of  [B]  in  order,  and 
for  r-0  to  j 

{  (rk/p+s)th  row  of  values  of  [A] 

}  /  *  for  r  *  / 

}  /  *  for  s  *  / 

In  addition,  the  following  allocation  is  made: 
for  s-(i+l)  to  k/p  (if  s<m) 

{  sth  string  of  p  processors  is  allocated  p 
columns  in  order  of  [B]  and, 
for  r-0  to  ( j  - 1  ) 

{  (rk/p+s)th  row  of  values  of  [A] 

}  /*  for  r  V 
}  /  *  for  s  *  / 

Note  that  no  i n t e r proc e s s o r  communication 
is  required  for  this  scheduling  algorithm. 

6.  If  the  values  of  m,  n,  and  p  are  totally 

arbitrary,  defying  any  trace  of  parallelism, 
then  letting  t-(mp),  q-lt/kl  and  r-(qk),  the 
allocation  of  [A]  and  [B]  values  is  done  by 
first  establishing  a  linear  relationship 
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between  the  individual  values  of  product 
ma  trice  [C]  such  that,  the  element  C[ i , j ] 
occupies  the  position  (pi+j-p)th  location 
in  the  linear  list.  With  this  assumption, 
the  allocation  of  [A]  and  [B]  values  among 
the  processors  can  be  done  according  to 
the  following  theorem. 

Theorem:  For  s-1  to  (t-r)  /*  parallel  sequence  */ 

{  for  1-0,  and  if  (t>k)  then,  /*  serial 
ltoq  sequence*/ 

{  sth  processor  computes  the  (lk+s)th 
value  in  the  linked  list  of  [C] 

}  /*  for  1  */ 

}  /  *  for  s  *  / 

1 f  (mp) >k ,  then 

for  s«(t-r+l)  to  k  /*  parallel  sequence  */ 

{  for  1-0  to  (q-1)  /*  serial  sequence  */ 

{  sth  processor  computes  (lk+s)th 
value  in  the  linked  list  of  [C] 

}  /•  for  1  */ 

}  /  *  for  s  •/ 

3.3  SUMMARY 

The  preceding  theorems  are  the  formulae  for  partitioning 
matrices  efficiently  for  the  speedy  performance  of  the  basic 
operations  of  addition,  subtraction,  and  scalar  and  matrix 
multiplication.  There  are  some  algorithms  developed  for  matrix 
inversion,  also,  but  they  are  not  encompassed  within  the  scope  of 


report  . 


this  report.  In  fact,  this  report  deals  explicitly  with 
verifying  the  partitioning  algorithms  for  matrix  multiplication. 
However,  the  emulator  can  presently  handle  matrix  addition, 
subtraction,  and  scalar  multiplication.  For  that  reason,  they 
were  included  in  this  section.  Martin  and  Sadhasivan  [10,11,12] 
have  researched  other  relevant  issues  that  are  aroused  by  these 
algorithms.  Their  findings  have  been  published,  and  may  be 
referred  to  in  order  to  enhance  one’s  knowledge  or  this  subject. 


It  i6  hoped  that  the  reader  will  be  able  to  follow  the 
aforementioned  theorems,  coupled  with  the  instructions  for 
programming  the  MJH1  to  successfully  partition  and  execute  a 
basic  matrix  computation  via  the  MJH1  emulator. 


CHAPTER  FOUR 


THE  MARTIN- JONES -HUGHES  VERSION  ONE 

4.0  INTRODUCTION 

The  Mart  in- Jones  - Hughes  Version  One,  as  previously  stated, 
is  an  emulated  multiprocessor  designed  specifically  to  perform 
matrix  computations.  A  software  description  of  the  MJH1  is 
written  in  the  structured  programming  language  of  Pascal.  The 
partitioning  algorithms  of  Chapter  Three  were  developed 
exclusively  for  mapping  the  matrix  operations  of  addition, 
subtraction,  scalar  multiplication,  and  matrix  multiplication 
onto  a  multiple  processor  architecture  such  as  the  MJH1 .  These 
algorithms  have  been  shown  to  provide  a  much  faster  execution 
time  when  implemented  on  an  appropriate  multiple  processor 
system.  The  MJH1  simply  verifies  the  partitioning  algorithms  and 
allows  for  the  intuitive  evaluation  of  computational  times 
associated  with  each  algorithm. 

The  specifics  of  the  MJH1  will  be  discussed  at  length  in  the 
subsequent  sections  of  this  chapter. 

4.1  SYSTEM  ARCH JTECTURE 

The  MJH1  is  classified  as  a  Single  I  ns t r uc t i on  -  S t r e am 
Multiple  Data-Stream  machine.  It  exemplifies  the  following 
attributes: 

-a  single  controller 

-an  array  of  8  processing  elements 

-an  Input/Output  scheme 


-central  memory 


In  the  subsections  that  follow,  the  components  of  the 
multiprocessor  will  be  thoroughly  discussed. 

4.2  MAJOR  COMPONENTS  OF  THE  MULTIPLE  PROCESSOR  ARCHITECTURE 

A  block  diagram  of  the  system  architecture  is  given  in 
Figure  4.1.  A  single  control  processor  is  connected  to  a  maximum 
of  eight  processing  elements  (PE’s).  A  data  bus  is  accessible  to 
all  eight  PE’s  via  a  Status  Bit  that  is  associated  with  each  PE 
configuring  the  system.  The  bits  are  numbered  from  bit  zero  to 
bit  seven,  with  the  most  significant  bit  of  the  status  vector 
referring  to  PE  0  and  the  least  significant  bit  of  the  status 
vector  referring  to  the  last  (highest  numbered)  PE  in  the 
conf i gu ra t i on  .  Instead  of  having  an  address  bus  that  contains 
the  address  of  specific  PE’s,  they  (PE’s)  are  identified  or  are 
activated  when  their  respective  status  bit  is  turned  on  (with  a 
logical  1).  The  PE’s  are  capable  of  reading  from  and  writing  to 
the  central  memory.  Overall  system  I/O  is  done  via  the  central 
memory.  That  is,  individual  processing  elements  do  not  have 
access  to  the  outside  world.  They  conxnunicate  with  central 
memory,  which  in  turn  interacts  with  the  outside  world.  Operand 
matrices  are  read  into  the  PE’s,  the  operations  performed,  and 
the  results  returned  to  the  central  memory  of  the  control  unit. 


BLOCK  DIAGRAM  OF  MULTIPROCESSOR  ARCHITECTURE 
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4.2.1  The  Controller 

The  controller  is  responsible  for  initializing  all  PE’s  at 
runtime.  Initializing  processing  elements  involves  zeroing  out 
all  memory  in  the  PE’s  and  calculating  their  shift  neighbors 
(should  a  shift  function  need  to  be  performed).  The  controller 
is  a  general  purpose  computer  with  central  memory  and 
input/output  capability. 

The  control  processor  is  also  responsible  for  program 
execution.  A  typical  instruction  execution  proceeds  in  the 
following  manner.  The  control  processor  fetches  an  instruction 
from  central  memory  and  broadcasts  it  to  all  PE’s.  Only  the 
enabled  PE’s  will  be  affected  by  the  broadcasted  instruction. 
The  instruction  is  then  decoded  and  executed  by  the  appropriate 
PE’s.  The  memory  address  register  of  the  controller  is  then 
incremented  and  the  next  instruction  is  fetched,  broadcasted,  and 
executed.  The  cycle  continues  until  a  CPHa 1 t  instruction  is 
invoked  . 

4.2.2  The  Processing  Element 

Each  processing  element  is  identical  to  one  another.  A 
typical  PE  has  an  ALU  of  its  own  that  can  perform  addition, 
subtraction,  division,  scalar  multiplication,  and  matrix 
multiplication.  A  typical  PE  has  its  own  memory,  which  is 
referred  to  as  local  memory.  This  local  memory  is  partitioned 
into  three  sections:  memory  A  (MA) ,  memory  B  (MB),  and  memory  R 
(MR),  where  A,  B,  and  R  refer  to  the  operand  matrices  (A  and  B) 
and  the  result  matrix  R.  Also,  each  enabled  PE  executes 
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instructions  that  are  broadcasted  to  it  by  the  controller.  T 
execution  is  referred  to  as  concurrent  or  parallel  processing, 
block  diagram  of  the  PE  is  shown  in  Figure  4.2. 
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4.2.3  I/O  SCHEME 

The  Input/Output  (I/O)  of  data  is  performed  by  the  control 
processor.  Data  can  either  be  written  into  central  memory  by 
various  active  processing  elements,  or  it  can  be  accessed  by 
those  same  active  processors.  That  is,  each  active  processing 
element  can  write  to  as  well  as  read  from  central  memory.  This 
data  traverses  along  a  common  bidirectional  data  bus. 

Along  writh  reading  and  writing  matrices,  each  PE  is  capable 
of  sending  and  receiving  data  to  or  from  other  neighboring  PE’s 
via  a  shift  network.  The  shift  network  consists  of  a  shift  port 
(SPORT)  and  a  shift  memory.  This  shift  memory  (MS)  is  also 
located  w’ithin  the  local  memory  of  each  PE. 

4.2.4  CENTRAL  MEMORY 


The  central  memory  (CM)  contains  the  instruction  sequence 
needed  to  perform  a  particular  computation.  Also  contained  in 
central  memory  are  the  operand  matrices,  A  and  B.  After  the 
result  matrix,  R,  has  been  generated  via  the  active  PE’s,  it, 
too,  is  stored  into  central  memory,  where  it  awaits  to  be 
accessed  and  written  out  to  the  outside  world. 

4.3  THE  MJH1  EMULATOR 


The  MJH1  Emulator  is  s  software  tool  that  emulates  the 
multiprocessor  architecture  that  has  been  previously  described. 
This  emulator  was  first  written  by  E.  Jones  [3],  and  later 
modified  by  the  author  of  this  thesis.  The  emulator  is  written 
in  Pascal  mainly  because  it  is  a  structured  programming  language 
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and  is  more  conducive  to  describing  the  various  modules  that  are 
necessary  in  defining  this  SIMD  architecture  in  software.  Also, 
Pascal  is  a  fairly  simple  language.  A  program  listing  of  the 
emulator  is  included  in  Appendix  A,  however,  the  major  components 
of  the  emulator  will  be  discussed  here. 

Since  the  MJH1  emulates  a  machine  that  is  capable  of 
executing  instructions  to  perform  the  partitioning  of  matrices, 
it  is,  in  essence  a  simulated  computer.  Thus,  in  order  to  show 
that  this  emulator  is  indeed  practical,  a  discussion  of  the  type 
of  instructions  that  must  be  included  in  a  practical  computer 
foil  ows  . 

A  computer  should  have  a  set  of  instructions  that  allows  the 
user  to  formulate  any  conceivable  data  processing  task.  To 
ensure  this,  the  computer  (or  in  this  case,  the  emulator)  must 
include  a  sufficient  number  of  instructions  in  each  of  the 
foil ow ing  categories: 

1.  Arithmetic,  logical,  and  shift  instructions. 

2.  Instructions  for  moving  information  to  and  from  memory  and 
processor  registors. 

3.  Instructions  that  check  status  information  to  provide 
decision  making  capabilities. 

4.  Input  and  output  instructions. 

5.  The  capability  of  stopping  the  computer. 

Since  the  MJH1  provides  instructions  from  each  category 
listed,  it  is  a  functionally  complete  machine  [13]. 


4.3.1  MAJOR  COMPONENTS  OF  THE  MJH1 


The  MJH1  is  composed  of  several  modules.  One  such  module 
defines  the  global  constants  that  are  used  in  defining  the  actual 
configuration  of  active  processors.  A  definition  of  the 
instruction  set  comprises  another  module  of  the  emulator.  Next, 
a  declaration  of  the  data  types  used  in  defining  variables  used 
in  the  emulator  are  included.  Variables  required  to  successfully 
run  the  emulator  are  contained  in  yet  another  module.  The  module 
wh i c h  has  the  most  impact,  however,  is  the  processor  module. 
This  module  contains  code  for  executing  the  overall  instruction 
set,  which  consists  of  both  control  processor  and  processing 
element  instructions.  This  section  also  contains  the  Pascal  code 
for  processor  initialization,  as  well  as  for  processor-state 
dumping  into  the  trace  file.  These  modules  wrill  be  further 
discussed  in  the  subsections  that  foil  ow . 

4 . 3  .  1  .  1  GLOBAL  CONSTANTS 

Table  4.1  contains  a  listing  of  the  global  constants  that 
are  used  throughout  the  emulator.  These  constants  define  the 
actual  configuration  of  each  active  processing  element. 


TABLE  4 . 1 


(*  Global  Constants:  These  define  the  actual  configuration  *) 
const 


MAXN'P  -  8  ;  {Max  Number  of  arithmetic  procesiora  .  } 
NPM1  «  7;  {  Number  of  processors  minus  one.  } 
MENMAX  -  256;  {  Size  of  each  AP  local  memory.  } 
CPMEXfvlAX  -  204  8;  {  Size  of  central  memory  (CM).  ) 
SHREGMAX  -  1;  {  Size  of  shift-register  memory.  } 
MAXDATAYAL  -  25600;  {  Hypothetical  overflow  value.  } 

zeroval  -  0;  {  Constant  for  integer  data  type.  } 
trueval  -  *  1  * ;  {  Char  representation  of  logical  "true”.} 
falseval  -  ’O’;  {  Char  repr'n  of  logical  "false".  } 


Overall  central  memory  of  the  multiprocessor  architecture  is 
simulated  here,  along  with  the  individual  processors. 

4. 3.  1.2  THE  INSTRUCTION  SET 

The  MJH1  consists  of  forty  instructions.  These  instructions 
are  of  two  types:  Processing  Element  (PE)  instructions  and 

Control  Processor  (CP)  instructions.  Two  instruction  sets  are 
necessary  due  to  the  nature  of  the  multiprocessor  architecture 
that  is  being  emulated.  The  instruction  set  for  the  MJH1  is 
summarized  in  Table  4.2.  Note  that  PE  instructions  have  two 
digit  opcodes,  while  CP  opcodes  have  three  digit  opcodes. 


TABLE  4.2 


(*  PE  INSTRUCTION  SET  MNEMONICS  AND  OPCODES.  *) 

{  MISCELLANEOUS  INSTRUCTIONS.  } 

NOOP  -0;  {  No  Op:  Allows  documentation  of  code.  } 


{  ADDRESSING  INSTRUCTIONS.  } 

ADBASE  -10;  {Advance  LM  base  register.  } 

SETB  -  11;  (Set  LM  base  register  to  vector  origin.} 

ALLOC  -12;  {Allocate  c  more  words  in  LM.  } 

SETMA  -  IS;  {Set  MAR  to  address  in  Ql.  } 

ADMA  -  16;  {  Advance  MAR  by  specified  value.  } 

{  CM  ACCESS  and  SHIFT  INSTRUCTIONS.  } 

LOADCM  -20;  {  Load  into  LM  from  CM.  } 

STORCM  -21;  {  Store  into  CM  from  LM .  } 

FCSHF  -  22;  {  Forward  circular  shift.  } 

BCSHF  -  23;  {  Backward  circular  shift.  } 

PSSHF  -  24;  {  Perfect  shuffle  shift.  } 

SHFIN  -  25;  {  Receive  shifted  data.  } 

SHIFX  -  26;  {  Put  data  on  shift  port.  } 

{  DATA  MOVEMENT  INSTRUCTIONS.  } 

MOVA  -  30;  {  Move  data  to  MA  from  another  LM.  } 

MOVB  -  31;  {  Move  data  to  MB  from  another  LM.  } 

MOVR  -  32;  {  Move  data  to  MR  from  another  LM.  } 

MOVS  -  33;  {  Move  data  to  MS  from  another  LM.  } 

{  SCALAR  ARITHMETIC  INSTRUCTIONS.  } 

SADD  -  40;  {  Add  scalar  values  in  MA  and  MB.  } 

SSUB  -  41;  {  Subtract  scalar  values  in  MA  and  MB.  } 

SMPY  -  42;  {  Multiply  scalar  values  in  MA  and  MB.  } 

SDIV  -  43;  {  Divide  scalar  values  in  MA  and  MB.  } 

SMSA  -  44;  {  Add  MS  [  i  ]  to  a  LM.  } 

{  VECTOR  -  SCALAR  ARITHMETIC  INSTRUCTIONS.  } 

{  NOTE :  Vector  in  MA ,  Scalar  in  MB .  } 

VSADD  -SO;  {  Ad d  scalar  to  vector.  } 

VSSUB  -  51;  {  Subtract  scalar  from  vector.  } 

VSMPY  -52;  {  Multiply  vector  by  scalar.  } 

VSDIV  -  53;  {  Divide  vector  by  scalar.  } 

{  VECTOR -VECTOR  ARITHMETIC  INSTRUCTIONS.  } 

PVADD  -  60;  {  Pair-wise  vector  addition.  } 

PVSUB  -  61;  {  Pair-wise  vector  subtraction.  } 

PVMPY  -  62;  {  Pair-wise  vector  multiplication.  } 

INPRD  -  65;  {  Vector  inner  product.  } 


{  STATUS -CHECKING  INSTRUCTIONS.  } 

TST  -  70;  {  Test  statusword  for  error  condition.  } 
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(*  CP  INSTRUCTION  SET  MNEMONICS  AND  OPCODES.  •) 


{  CONTROL  INSTRUCTIONS.  } 

ENABLE  -  100;  {  Enable  processors  specified  by  mask.  } 

PUSHM  -  101;  {  Push  current  active  mask  onto  MSTACK.) 

PULLM  -  102;  {  Restore  (pull)  mask  from  MSTACK.  } 

SETT  -  103;  {Set  MJH1  TRACE  level.  } 

CPHALT  -255;  {Shut  down  Control  Processor.  } 

{  I/O  INSTRUCTIONS.  } 

MREAD  -  110;  {  Read  M  i  N  matrix  into  CM.  ) 

MREADB  -  113;  {  READ  M  i  N  matrii  [B]  into  CM  } 

NAVRITE  -  111;  {  Write  M  x  N  matrix  stored  in  CM.  } 

{  DATA  MANAGEMENT  INSTRUCTIONS.  } 

SETV  -120;  {Set  range  of  CM  locations  to  value.  } 


Each  instruction  consists  of  an  opcode  followed  by  three 
operands.  An  illustration  of  the  instruction  format  is  shown 
b e 1 ow  in  Figure  4.3. 


OPCODE  OP1  OP 2  OP 3 


FIGURE  4.3 

These  operands,  in  general  specify  addressing  modes.  For 
example,  the  following  symbols  can  be  used  for  operand  one 
depending  upon  the  particular  instruction  desired. 

m  :-  local  memory  specifier 

1  ->  MA 

2  -  >  MB 

3  ->  MR 

4  ->  MS 

v  an  immediate  value 

c  a  count 

i  an  index,  or  immediate  value 

p  :-  a  specific  processing  element 

k  : -  #  of  col umn  values  for  matrix  multiplication 
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IA  : -  index  for  MA 

t  :  -  trace  level 

0  «>  no  trace 

1  ->  processor  state  only 

2  «>  full  processor  state 

Symbols  for  operand  two  are  similar  to  those  of  operand  one. 
However,  all  symbols  used  in  operand  one  are  not  used  in  operand 
two.  A  listing  of  these  operands  follow, 
i  : -  index  or  inxnediate  value 
v  : «  an  immediate  value 
c  a  count 

p  PE  number 

m  : -  column  dimension  of  a  matrix 

IB  :«  index  for  MB 

There  is  only  one  specified  symbol  for  operand  three.  Despite 
this  fact,  an  operand  must  be  supplied  in  each  field  for  each 
instruction  that  is  to  be  executed.  For  that  reason,  zeros  must 
be  supplied  in  a  particular  field  when  no  opcode  is  required. 
The  defined  symbol  for  operand  three  follows, 
m  column  dimension  of  a  matrix 

Consider,  for  example,  the  instruction,  12  2  2  0.  This 

instruction  says  to  allocate  two  spaces  in  local  memory  B. 
Operands  two  and  three  correspond  to  m  and  c,  and  operand  three 
is  a  zero  because  it  is  not  needed.  The  programmer  is  referred 
to  Appendix  A,  more  specifically  to  pages  A-10  through  A-14,  and 
A-16,  for  the  correct  placement  of  these  symbols  in  a  specific 


instruction. 
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The  emulator  has  the  capability  of  performing  both  vector 
and  scalar  operations.  Vector  operations  are  those  that  involve 
streaming  data  to  one  processor  (serially),  then  to  the  next,  and 
so  on.  Scalar  operations  involve  interleaving  or  performing 
operations  concurrently  on  all  processors  simultaneously.  Upon 
examining  the  instruction  set  of  the  MJHl ,  the  user  will  be  able 
to  distinguish  the  vector  instructions  from  the  scalar 
instructions  by  their  groupings. 


i  -  -• 


jvA 


4 . 3 . 1 . 3  DATA  TYPES 


A  listing  of  the  various  types  of  data  used  in  the  emu  1  a  t  o  r 
are  shov/n  in  Table  4.3.  These  "type"  statements  are  similar  to 
the  declaration  statements  used  in  Fortran. 

TABLE  4.3 
DATA  TYPES 


dataval  -  integer;  {Data  Type  of  array 

matrixdim«0..31;  (Max  rows  /  cols 

matrix  -  a r ray [ma t r i xd im.ma t r i xd im] 
of  dataval; 


{  data  values. 

PEmask  -  record  {  Bit  string  indicating 

bit:  a r r ay [ 0 . . NPM1  ]  {  active  PEs. 

of  boo  lean; 


end  ; 

maskstack  -  array[l..lO]  {  Stack  of  active  PE  masks, 
of  PEmask; 

maskstr  -  ar ray[0 . .NPM1  ]  {  String  version  of  PE  mask, 

of  char; 


} 

} 


} 

} 

} 


} 

} 


statusword  «  array[0..3] 

{  4-bit  Status  word. 

} 

of  bool ean ; 

{  See  Below. 

} 

{  bit 

zero:  1  -  enab 1 ed /none omp 1 e t i on . 

} 

{  bits 

1-2:  00  -  no  exception. 

} 

{ 

01  »  a  r i thme  t i c 

except  ion: 

} 

{ 

bit  3 :  0  » 

zero  divide; 

) 

{ 

10  machine  exception. 

} 

{ 

11  »  operand  addressing  exception. 

} 

{ 

bit  3:  0  - 

addre  s  s  range ; 

} 

statustr 

-  array[0..3J  {  String  version  of  status  word, 

.} 

of  char  ; 

datamem  - 

a  r  r  ay  [  0  .  .MEMMAX  ] 

{  local  A ,  B ,  R  memo r i e s . 

} 

of  dataval  ; 

sh i f  tmem 

-  a  r  r  ay [ 0 . .SHREGMAX] 

{  Shift  register  memory, 

.} 

of  da  tava  1  ; 

cpmem  -  a  r  ray  [  0  .  .  CPMEKMAX ] 

{  The  Central  Memory. 

} 

of  da  tava  1  ; 

instruction  -  record 

opcode : integer 

;  {  Operation  code. 

} 

opl : integer  ; 

{  First  operand . 

) 

op2: integer; 

{  Second  operand. 

} 

op3: integer ; 

{  Third  operand. 

) 

end ; 

opcodese  t 

-  set  of  0 . . CPHALT ; 

{  Valid  processor 

opcodes . } 
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processor  -  record 

PROCI D :  i n t e g e r  ;  {  Procesor  ID.  } 
ACC : da t aval;  {  Accumulator.  } 
MAR:integer;  {  Memory  address  reg.  } 

{  Processor  ID  for  shifting.  } 
FCS I D : i n t e ge r ;  {  Forward  circular.  } 
BCS ID : i n t e g e r ;  {  Backward  circular.  } 
PSS I D : i n t e g e r ;  {  Perfect  Shuffle.  } 

{  Local  Memor  i  e  s  .  } 
MA:datamem;  {  A-operand  Memory.  } 
MB:datamem;  {  B-operand  Memory.  } 
MR:datamem;  {  Result  Memory.  } 
MS '.shiftmen:;  {  Shift  registers.  } 


SPORT : sh i f tmem;  {  I n t e r - pr oc e s s o r  shift  } 

{  port. 


{  Base  Registers  for  Local  Mem’s} 


MAB :  i  nt  ege  r  ;  { 

MBB : i n  t  e  g  e  r ;  { 

MRB : i n  t  e  g  e  r ;  { 

{  Current 
MAH : i n  t  e  ge  r ;  { 

MBH : i n  t  e  g  e  r ;  { 

MRH : i n  t  e ge  r  ;  { 


STATUS : s t a t u swo  rd ;  { 


Base  register  for  MA.  } 
Base  register  for  MB.  } 
Base  register  for  MR.  } 

Bounds-Reg  for  LM’ s  .  } 
Hi  in-use  MA  address.  } 
Hi  in-use  MB  address.  } 
Hi  in-use  MR  address.  } 

condition  code  bits.  } 


end;  {  processor  record.  } 
4. 3. 1.4  REQUIRED  VARIABLES 


Table  4.4  contains  a  listing  of  the  variables  required  to 
run  the  emulator.  Some  of  these  variables  must  be  supplied  by 
the  programmer.  These  variables  are  the  four  files:  code, 
operand,  trace,  and  result;  the  number  of  processors  in  the 
multiprocessing  configuration,  and  the  level  of  trace  desired. 
The  other  variables  listed,  unless  contained  in  the  files 
supplied  by  the  user,  are  generated  by  the  emulator.  These 
entities  will  be  further  discussed  later. 
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TABLE  4.4 


REQUIRED  VARIABLES 


code  f  i  1  e  :  t  ext  ; 


{ 


oprndf i 1 e : text  ; 
tracefile:text; 


resultfilertext; 


TRACE : integer ; 


{  File  containing  MJH1  code  along 
with  immediate  data. 

{  File  containing  matrix  values. 

{  File  to  contain  full  processor 
execution  trace. 

{  File  to  contain  "result"  (i.e., 
data  transmitted  by  "SEND" 
instructions. 

{  Trace  enabl e  f lag . 


{ 


{ 


{ 


IR: instruction; 

PE:  array[0. .NPM1 ] 
of  processor; 
NP : i n  t  e  ge  r ; 

NAP : i n  t  e  g  e  r  ; 

A,  B,R: matrix; 

CM:  c  pmem; 

CMHI : i n  t  e  g  e  r  ; 

ACT I VEMASK : PEma s k ; 
MSTACK :maskstack; 
MSTKTOP : i n  t  e  g  e  r ; 
COMPLETED: boolean; 


{  Broadcasted  instr.  register 
{  The  MJH1  network  of  PE’s. 


{ 


PEOPCODES :opcodeset 
CPOPCODE  S:opcodeset 
i  :  i  n  t  e  g  e  r  ; 

P: INTEGER ; 


Number  of  processors  in  use. 
Number  of  currently  active  PEs 
Operand  and  result  matrices 
The  CP  Central  Memory. 

Highest  in-use  CM  address. 
Current  active  processor  mask. 
Stack  of  active  PE  masks. 

Stack  pointer  for  MSTACK. 

Flag  set  when  active  PEs 
complete  instruction. 

Set  of  valid  PE  opcodes. 

Set  of  valid  CP  opcodes. 

Work  variable. 


4. 3. 1.5  PROCESSOR  MODULE 


This  module  is  analogous  to  the  control  processor  of  the 
multiprocessor  architecture  in  that  it  is  the  heart  of  the 
emulator.  Like  the  controller,  the  processor  module  is 
responsible  for  clearing  all  processor  memories  and  identifying 
all  shift  neighbors  for  each  processing  element  configured  in  the 
multiprocessing  network.  Along  with  this  initialization,  this 
module  also  performs  all  instruction  execution.  That  includes 
the  I/O  scheme,  also,  since  there  are  specific  instructions  for 
reading  and  writing  data  among  processors,  central  memory,  and 
the  outside  world.  This  function  is  also  compatible  with  that  of 


Page  55 


the  controller,  in 

that 

the  controller 

is  res  pons i b 1 e 

for 
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Instruction 

execut ion 

proceeds 
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initialization, 

the 

memory 

address 

register  (MAR ) 

o  f 

the 

controller  is  set 

to  a 

ne  ga  t i ve 

one  ( - 1 ) 

This  is  done 

s  0 

that 

once  the  MAP  is  incremented,  it  will  point  to  the  current 
instruction.  The  MAP  is  incremented  at  the  end  of  each 
instruction  decode  and  execution  cycle.  The  current  instruction 
is  fetched,  and  then  compared  first  to  the  Processing  Element 
opcodes,  and  then  to  the  Control  Processor  opcode,  in  the  event 
that  the  instruction  was  not  a  PE  opcode.  Once  a  match  has  been 
ma de  and  the  appropriate  instruction  has  been  identified,  then  it 
is  decoded  as  per  the  detailed  lines  of  code  included  in  the 
emulator.  The  MAR  is  incremented  and  the  next  instruction  is 
fetched,  decoded,  and  executed.  This  process  continues  until  a 
CPHa 1 t  instruction  is  issued. 

Major  sections  of  Pascal  code  from  the  processor  module 
perform  the  initialization  of  processors,  instruction  decoding, 
and  processing  dumping.  The  relevant  procedures,  as  they  are 
called  in  Pascal,  are  named  appropriately,  and  can  be  further 
studied  in  Appendix  A.  A  basic  flow  chart  of  the  overall 
processor  module  is  included  in  Figure  4.4.  The  actual 
procedures  are  included  in  Appendix  A,  where  the  entire  emulator 
is  located. 
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With  each  instruction  fetch  (Ifetch),  pertinent  information 
is  written  into  the  trace  file  by  the  processor  module  if  the 
level  of  trace  is  greater  than  1.  This  trace  file  will  prove 
invaluable  as  the  complexity  of  the  functions  performed  by  the 
emulator  increases. 

4.3.2  PROGRAMING  THE  MJHl 

Before  invoking  the  emulator  and  successfully  performing 
various  matrix  operations,  the  user  must  first  become  familiar 
with  specifics  of  the  emulator  itself.  Following  are  details  of 
the  emulator  that  should  aid  the  pr og r amme  r /us e r  in  obtaining 
f  avorable  results. 

There  are  four  files  that  must  be  assigned  to  the  MJHl 
before  a  successful  run  can  be  made.  These  files  are  the  operand 
file,  code  file,  trace  file,  and  the  result  file.  The  first  two 
are  input  files  that  must  be  created  prior  to  runtime,  while  the 
latter  two  are  output  files  that  are  generated  by  the  emulator 
itself. 

4 . 3 . 2 . 1  OPERAND  FILE 

The  operand  file,  oprndfile,  is  the  file  which  contains  the 
two  operand  matrices  A  and  B.  In  this  thesis,  matrix 
multiplication  will  be  looked  at  exclusively.  Thus,  the 
dimenaions  of  A  must  be  M  by  N,  while  matrix  Bmust  be  N  by  P, 
yielding  a  result  matrix  with  the  dimensions  M  by  P .  A  sample 
operand  file  follows. 


>vjs 
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Matrix  A  (4  x  l) 


10  20 


Matrix  B  (1  x  2) 


Notice  how  the  data  file  (whose  extension  is  .mtx)  is  written  in 


matrix  form. 


4. 3. 2. 2  CODE  FILE 


The  code  file  contains  the  instructions  to  be  executed  by 


the  emulator.  As  stated  before,  there  are  two  types  of 


instructions.  Processing  Element  (PE)  instructions  and  Control 


Processor  (CP)  instructions.  Two  instruction  sets  are  necessary 


due  to  the  nature  of  the  multiprocessor  architecture  that  is 


being  emulated.  A  listing  of  the  MJH1  instruction  set  is  shown 


in  Table  4.2. 


The  instruction  format  of  the  emulator  is  shown  in  Figure 


4.3.  The  progranxner  must  always  supply  three  operands  with  an 


opcode.  This  is  true  even  if  all  three  fields  are  not  being 


used.  In  this  case,  zeros  must  be  supplied  in  the  appropriate 


operand  field.  In  general,  the  PE  opcodes  contain  only  2  digits, 
while  CP  opcodes  contain  3  digits.  Operands  1  and  2  generally 
specify  the  source  of  data,  while  operand  3  specifies  its 
destination  after  some  operation  has  been  performed.  In  the 


emulator,  however,  required  operands  are  coded  as  follows: 


m  :«  local  memory  specifier 

1  ->  MA 

2  ->  MB 

3  ->  MR 

4  ->  MS 

v  :«  an  immediate  value 
c  a  count 

i  :«  an  index,  or  immediate  value 

IA,  IB,  IR  :«  indices  for  MA ,  MB,  MR,  respectively 

In  order  to  enable  processors,  the  ENABLE  insruction  must  be 
used.  Operand  one  of  this  instruction  is  the  decimal  equivalent 
of  the  binary  string  of  processors  comprising  the  system.  A  ’1* 
indicates  that  a  certain  processor  is  turned  on,  or  enabled, 
wh ile  a  ’0’  indicates  that  a  certain  processor  is  turned  off.  In 
this  binary  string  of  processors,  the  most  significant  bit 
represents  processing  element  0  while  the  least  significant  bit 
represents  the  highest  numbered  PE  in  the  configuration  For 
example,  if  two  PE’s  comprise  a  system,  then 

100  200 

means  to  enable  PE  0  only,  while 

100  300 

means  to  enable  all  processors.  Remember  that  the  binary 
representation  of  2  is  10,  and  the  binary  representation  of  3  is 
1 1  . 

4. 3. 2. 3  TRACE  FILE 

The  trace  file  is  a  file  that  contains  a  detailed  account  of 
all  activity  within  the  emulator  on  each  clock  cycle.  There  are 
varying  levels  of  tracing  available: 
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1)  no  trace 

2)  processor  state  only 

3)  full  processor  trace 

Level  0  is  the  lowest,  simplest  level  of  trace,  while  level  2  is 
the  most  comprehensive  trace  level.  Although  the  full  trace  is 
lengthy  and  consumes  a  significant  amount  of  space,  it  is  a  major 
feature  of  the  emulator,  in  that  it  allows  one  to  see  what 
happens  to  every  active  processor  during  each  clock  cycle.  In 
case  of  an  erroneous  result,  the  source  can  easily  be  traced. 


4. 3. 2. 4  RESULT  FILE 


The  result  file  contains  the  operand  matrices  along  with  the 
result  file  that  is  created  by  the  MJH1  after  simulation. 


4.4  FILE  NAMING  CONVENTIONS 


Following  is  a  table  that  describes  the  way  that  files 
should  be  named.  Table  4.5. 
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TABLE  4.5 


FILE  NAMING  CONVENTIONS 


EXTENSION 


DESCRIPTION 


.mt  x 


.  cod 


Operand  file  containing  two  matrices  A  &  B. 
Prefix  of  filename  should  specify  dimensions 
or  algorithm  with  wh  ich  it  is  to  be  used, 
eg.,  MEQKN P4,  a  file  containing  2  matrices  with 
M  equal  to  number  of  processors  (K),  and  PE  -  4 

Code  file  that  is  read  as  input  by  the  emulator 
Note  that  the  code  must  be  in  absolute  form-- 
n  o  s  ymb  olic  coding  is  all  owe  d . 

Trace  file  that  is  created  while  the  emulator 
is  executing  instructions.  The  trace  level 
(0,1,2),  which  determines  the  volume  of  trace 
output,  can  be  set  at  the  start  of  execution 
by  the  user,  or  it  can  be  set  by  the  programmer 
using  the  SETT  instruction. 

Result  file  that  is  created  by  MJH1 .  This  file 
contains  both  operand  matrices,  followed  by  the 
result  matrix. 


4.5  CREATING  THE  CODE  FILE 


This  is  perhaps  the  most  important  file  that  must  be 
submitted  to  the  MJH1 .  This  is  the  file  that  performs  the  matrix 
partitioning  based  on  the  partitioning  algorithms  of  Chapter 
Three.  This  file  can  be  easily  generated  by  hand,  although  the 
programmer  must  be  familiar  with  the  assembly  code  that  is 
specific  to  the  MJH1 .  Thus,  some  background  in  assembly  language 
programming,  i.e.  microprogramming,  is  very  helpful  in  the 
programming  of  this  machine. 

Before  the  code  can  be  written,  however,  the  progranxner  must 
know  aeveral  "design"  parameters.  They  are  as  follows.  He  must 
know  the  number  of  processors  that  will  be  active  in  the 
configuration  and  he  must  also  know  the  dimensions  of  the  operand 
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matrices.  From  this  basic  information,  the  correct  partitioning 
algorithm  can  be  chosen  for  the  most  efficient  allocation  of 
values  to  tbe  active  processors. 

The  code  file  contains  three  sections  as  shown  in  Figure 
4.5.  The  first  section  involves  allocating  space  in  central 
memory  for  three  matrices:  matrix  A,  matrix  B,  and  matrix  R. 
The  second  section  defines  the  partitioning  of  the  operand 
matrices.  The  third  and  final  section  of  the  code  file  performs 
matrix  multiplication  (or  whatever  matrix  operation  desired),  and 
writes  the  result  matrix  back  to  central  memory.  A  sample  code 
file  with  conxnents  follows  in  Figure  4.6.  This  sample  is  an 
implementation  of  partitioning  algorithm  for  matrix 
multiplication  when  the  number  of  rows,  m,is  a  multiple  of  the 
number  of  active  processors.  This  is  the  algorithm  numbered  3b 
which  can  be  referenced  in  Chapter  Three,  Section  Three  under  the 
same,  3b.  Note  that  the  first  instruction  is  a  no-op  whose 
c omne n t  statement  describes  the  way  in  which  each  ma r t i x  is 
partitioned  and  distributed  among  each  active  processor. 


Memory 


r- 

p  V 

£  v 

1  ^ 

■s 

K  v. 

Page  65 

r. 

FIGURE  4.6 
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M 

SAMPLE  CODE  FILE 

\ 

ALG3B : 

4 

x  1 

BY  1  x  2;  NP-2.  [IMULTK] 

r. 

r 

0 

0 

0 

0 

(m/k)  rows  of  [A]  and  entire  [ B ] 

\  • 
f. 

120 

16 

0 

0 

Clear  16  words  in  CM 

)  ■ 

'  ‘ 

1  10 

0 

4 

1 

Read  matrix  A  [MEMORY 

1  1  1 

0 

4 

1 

Echo  A  ALLOCATION] 

1 10 

4 

1 

2 

Read  matrix  B  and  echo 

’ J 

'j  r 

1 1 1 

4 

1 

2 

Echo  B 

0 

0 

0 

0 

Start  partitioning 

| 

100 

3 

0 

0 

Enable  ALL  PE’s  to  load  B 

1  ^ 

v  .*  • 

15 

4 

0 

0 

PEs  at  B [ 1 , 1 ] 

-  •: 

12 

2 

2 

0 

Allocate  space  in  MB  for  2  values 

r* 

X 

20 

2 

2 

0 

Mo v e  2  values  into  MB 

r 

100 

2 

0 

0 

Enable  PE  0 

»,  . 

15 

0 

0 

0 

Set  address  to  A[ 1 , 1 ] 

« 

100 

1 

0 

0 

Enable  PE  1  [PARTITIONING] 

r  . 

15 

2 

0 

0 

Set  address  to  A[3,l] 

100 

3 

0 

0 

Enable  ALL  PEs 

103 

3 

0 

0 

Start  TRACE 

■- 

12 

1 

2 

0 

Allocate  space  for  A 

■-  * . 

20 

1 

2 

0 

Load  various  rows  into  MA 

i* 

0 

0 

0 

0 

Prepare  to  multiply 

1 1 

1 

0 

0 

Set  MA  to  0 

: .. 

1 1 

2 

0 

0 

Set  MB  to  0 

: 

1 1 

3 

0 

0 

Set  MR  to  0 

f  ■' 

12 

3 

4 

0 

Allocate  to om  i n  MR  for  2  values 

l« 

65 

0 

1 

0 

Mu  1 1 i p 1 y 

11 

1 

0 

0 

Reset  B 

1 1 

2 

1 

0 

", 

65 

1 

1 

0 

v. 

*  *  « 

1 1 

1 

1 

0 

K  ' 

C 

1  1 

2 

0 

0 

^  Vr 

65 

2 

1 

0 

1  1 

1 

1 

0 

K 

1 1 

2 

1 

0 

f: 

65 

3 

1 

0 

[MATRIX 

100 

2 

0 

0 

Enable  PO  COMPUTATION] 

c 

15 

6 

0 

0 

Set  MAR  to  1st  result 

100 

1 

0 

0 

Enabl e  Pi 

15 

10 

0 

0 

Set  MAR  to  2nd  result 

•w*  V 

L" 

100 

3 

0 

0 

Enable  ALL  PEs 

s.* 

11 

3 

0 

0 

V" 

L"  . 

21 

4 

0 

0 

Store  1st  results 

^  > 

111 

6 

4 

2 

PRINT 

s  & 

103 

0 

0 

0 

STOP  TRACE 

f 

A  . 

* 

255 

0 

0 

0 

CPHALT 

c 

■.  - 

&E 

■  k 

,A 
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Note  that  special  care  must  be  taken  when  waiting  the  result 
matrix  back  to  the  central  memory.  The  MJH1  architecture  is 
defined  such  that  data  is  read  and  stored  in  a  linear, 
one-dimensional  array.  In  other  words,  each  data  word  is  stored 
in  sequential  memory  locations.  Also,  matrices  are  read  in  row 
order.  If  matrix  A  contained  two  rows  and  three  columns,  it 
would  be  read  into  central  memory  by  rows.  The  first  element  of 
row’  1  would  be  followed  by  the  second  element  of  row  1,  etc., 
until  the  third  element  of  row  2  was  read  (matrix  A  has  the 
dimensions  of  2  by  3).  Thus,  after  the  operations  have  been 
performed  by  all  active  PE’s,  the  programmer  must  be  aware  of 
wh  ich  processor  c  omp  u  t  e  d  wh  at  result,  then  make  sure  that  the 
result  is  written  back  to  the  proper  location  in  central  memory. 

After  the  code  file  has  been  created  in  accordance  to  the 
particular  algorithm,  the  MJH1  may  be  invoked,  thereby  testing 
the  correctness  of  the  partitioning  algorithm,  and  of  the  code 
file  created  by  the  programmer.  Remember  that  the  burden  of 
progranxning  to  exploit  the  inherent  parallelism  in  such  an 
environment  is  placed  solely  on  the  programmer.  Thus,  the 
correctness  of  the  codefile  determines  the  correctness  of  the 
s  imu  1  a  t  i  on  . 

4.6  RUNNING  THE  MJH1 

The  MJH1  emulator  is  very  easy  to  operate.  The 
us e r /pr og r amme r  must  first  have  access  to  an  account  on  the  VAX 
11/780.  In  order  to  obtain  an  account,  see  the  system  operator. 
After  one  has  an  account  and  has  successfully  logged  onto  the 
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system,  the  emulator  must  be  made  accessible  to  the  user’s 
account.  This  can  be  done  by  setting  the  default  to  the 
residence  of  the  emulator  as  follows: 

SET  DEF  {HUGHES.  EMULATOR] 

This  command  allows  the  programmer  access  to  the  MJH1 ,  and  to 
previously  created  files  relative  to  the  operation  of  the 
emulator.  The  user  should  feel  free  to  explore  and  examine  the 
existing  files  by  typing  them  out  onto  the  terminal.  ThiB  will 
further  familiarize  the  user  with  the  machine  and  its  operation. 
The  conxnand:  RUN  MJH1EM  invokes  the  emulator.  Note  that  all 

files  being  run  on  or  sent  to  the  emulator  should  reside  in 
[ HUGHE S . EMULATOR ] .  Therefore,  the  user  must  either  create  both 
the  operand  file  and  code  file  in  that  directory,  or  must  copy 
them  over  to  [ HUGHE S . EMULATOR ]  before  invoking  the  emulator. 

The  emulator  will  prompt  the  user  for  the  various  filenames, 
the  number  of  active  procesors,  and  for  the  level  of  tracing  as 
foil ows  : 

ENTER  NAME  OF  OPERAND  FILE  : 

ENTER  NAME  OF  CODE  FILE  : 

ENTER  NAME  OF  RESULT  FILE  : 

ENTER  NAME  OF  TRACE  FILE  : 

ENTER  NUMBER  OF  PROCESSORS  : 

ENTER  PROCESSING  TRACING  LEVEL 

0)  NO  TRACE 

1 )  PROCESSOR  STATE  ONLY 

2)  FULL  PROCESSOR  TRACE 


After  these  parameters  have  been  entered,  the  emulator  goes  to 
work  and  executes  each  instruction.  In  other  words,  it  does 


exactly  what  it  is  programmed  to  do.  When  the  CPHa 1 t  conmand  has 
been  executed,  the  emulator  stops,  and  the  message 

MJH1  SHUT  -  DOVN  !  ! 

is  displayed  on  the  screen.  To  see  wh at  has  just  transpired,  the 
result  file  and  the  trace  file  should  be  either  typed  or  printed. 
Upon  examination  of  these  files,  one  will  be  able  to  tell  if  the 
run  wa a  successful  or  not. 

A  demonstration  of  the  \* 1 H 1  Emulator  is  found  in  Appendix  C. 
Algorithm  3B  of  Chapter  Three  is  implemented  here,  the  same 
algorithm  that  is  used  in  Figure  4.6.  This  demonstration 
includes  a  sample  operand  file,  code  file,  result  file,  and  trace 
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CHAPTER  FIVE 
SUMMARY  AND  SUGGESTIONS 


5.0  S UNWARY 


I  n 

s  unxna  t  i  on  , 

the 

MJHl 

is  a  very  good  software 

tool  to  use 

the 

analysis 

o  f 

the 

correctness  of  various 

partitioning 

algorithms.  It  has  the  potential  to  be  a  forerunner  in  the 
discipline  of  Computer  Engineering  at  North  Carolina  A  &  T  State 
University.  This  is  possible  because  the  emulator  is  easily 
adaptable  to  fit  system  architectures  other  than  SIMD.  Through 
simulation,  different  architectures  can  be  implemented,  as  well 
as  various  algorithms  for  exclusive  partitioning  of  matrices  to 
perform  simple  matrix  operations  in  a  parallel  or  multiprocessor 
envi r onmen  t . 

The  MJHl  Emulator  has  the  potential  to  be  expanded 
tremendously.  The  instruction  set  could  grow  immensely.  Also, 
the  emulator  could  be  made  into  a  smarter,  more  intelligent 
machine  by  combining  some  existing  commands  and  making  its 
language  of  a  higher  level. 

5.1  SUGGESTIONS 

Suggestions  for  future  work  include  devising  a  way  to 
generate  code  necessary  for  the  operation  of  the  MJHl  in  a 
systematic,  automatic  fashion.  Upon  careful  examination  of  all 
existing  code  files,  one  can  see  the  existence  of  a  pattern.  Due 
to  time  constraints,  the  author  did  not  have  time  to  decipher  the 
pattern,  however,  she  acknowledges  the  fact  that  a  pattern  does 
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exist.  Thus,  a  code  generator  should  be  looked  into  in  order  to 
make  the  MJH1  a  more  attractive  entity.  The  programmer  can  see, 
by  examining  some  of  the  existing  codes,  that  programming  the 
MJH1  by  hand  can  be  e xc ru t  i  a t  i  ng 1 y  long  and  frustrating. 

Also,  the  algorithms  for  matrix  inversion  should  be  looked 
into  and  possibly  implemented  on  the  emulator.  After  these  two 
feats  have  been  successfully  accomplished,  the  author  feels  that 
the  emulator  should  be  expanded  to  be  able  to  handle  more  PE’s. 
Currently,  the  maximum  number  of  PEs  in  a  system  configuration  is 
only  eight.  However,  before  adding  more  PEs  in  the 
configuration,  the  controller  should  be  expanded  to  include  error 
detection  and/or  error  correction. 
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APPENDIX  A 


PROGRAM  LISTING  OF  THE  MJH1  EMULATOR 


p  r  o  g  r  am 

MJHl( input, output, codefile,oprndfile,tracefile,resultfile); 
{  INCLUDE  MJH1.VAR  } 


(*  Declaration  of  MJ - 1  Arithmetic  Processor  State  Register.  *) 
(  *  •  ) 
(*  Global  Constants:  These  define  the  actual  configuration  *) 
(*  *) 

const 

MAXNP  -  8  ;  {Max  Number  of  arithmetic  processors.  } 
NPM1  -  7;  {  Number  of  processors  minus  one.  } 
MEMMAX  -  256;  {  Size  of  each  AP  local  memory.  } 
CPMENMAX  -  2048;  {  Size  of  central  memory  (04).  } 
SHREGMAX  -  1;  {  Size  of  shift-register  memory.  } 
MAXDATAVAL  -  25600;  {  Hypothetical  overflow  value.  } 

zeroval  -  0;  {  Constant  for  integer  data  type.  } 
trueval  -  *1’;  {  Char  representation  of  logical  "true”.} 
falseval  -  ’O’;  {  Char  repr’n  of  logical  "false".  } 


(*  PE  INSTRUCTION  SET  MNEMONICS  AND  OPCODES.  •) 

{  MISCELLANEOUS  INSTRUCTIONS.  } 

NOuP  -  0;  (No  Op:  Allows  documentation  of  code.  ) 

{  ADDRESSING  INSTRUCTIONS.  ) 

ADBASE  -10;  {Advance  LM  base  register.  } 

SETB  -  11;  {  Set  LM  base  register  to  vector  origin.) 

ALLOC  -12;  {  Allocate  c  more  words  in  LM.  ) 

SETMA  -  15;  {  Set  MAR  to  address  in  CM.  } 

AEMA  -  16;  {  Advance  MAR  by  specified  value.  } 


{  CM  ACCESS  and  SHIFT  INSTRUCTIONS.  ) 
LOADCM  -20;  {Load  into  LM  from  CM. 


} 


STORCM  -21;  {  Store  into  CM  from  LM. 

FCSHF  -  22;  {  Forward  circular  shift. 

BCSHF  -  23;  {  Backward  circular  shift. 

PSSHF  -  24;  {  Perfect  shuffle  shift. 

SHF  IN  -  25;  {  Receive  shifted  data. 

SHIFX  -  26;  {  Put  data  on  shift  port. 

{  DATA  MOVEMENT  INSTRUCTIONS.  } 


MDVA 

- 

30 

{ 

Move  data 

t  o  MA  from 

another 

LM. 

MONT 

- 

31 

{ 

Move  data 

t  o  MB  from 

another 

LM. 

MONT. 

- 

32 

{ 

Move  data 

t  o  MR  from 

another 

LM. 

MOVS 

- 

33 

{ 

Move  data 

t o  MS  from 

another 

LM. 

{ 

SCALAR  ARITHMETIC  INSTRUCTIONS.  } 

SADD 

- 

40 

{ 

Add  scalar  values  in 

MA  and  MB  . 

SSUB 

- 

41 

{ 

Subtract 

scalar  values  in  MA 

and  MB  . 

SMPY 

- 

42 

{ 

Mu  1 t i p 1 y 

scalar  values  in  MA 

and  MB . 

SDI V 

- 

43 

{ 

Divide  scalar  values 

i n  MA  and  MB . 

SMSA 

- 

44 

{ 

Add  MS  [ i  ] 

to  a  LM. 

{  VECTOR  -  SCALAR  ARITHMETIC  INSTRUCTIONS  .  } 

{  NOTE:  Vector  in  MA ,  Scalar  in  MB.  } 

V S ADD  -  50;  {  Add  scalar  to  vector. 

VS SUB  «  51;  {  Subtract  scalar  from  vector. 

VSMPY  -52;  {  Multiply  vector  by  scalar. 

VSDIV  -  53;  {  Divide  vector  by  scalar. 

{  VECTOR-VECTOR  ARITHMETIC  INSTRUCTIONS.  } 

PVADD  -  60;  {  Pair-wise  vector  addition. 

PVSUB  -  61;  {  Pair-wise  vector  subtraction. 

PVMPY  -  62;  {  Pair-wise  vector  multiplication. 

INPRD  -  65;  {  Vector  inner  product. 

{  STATUS -CHECKING  INSTRUCTIONS.  } 

TST  -  70;  {  Test  statusword  for  error  condition. 


(*  CP  INSTRUCTION  SET  MNEMONICS  AND  OPCODES. 

{  CONTROL  INSTRUCTIONS.  } 

ENABLE  -  100;  {  Enable  processors  specified  by  mask. 

PUSHM  -  101;  {  Push  current  active  mask  onto  MSTACK 

PULLM  -  102;  {  Restore  (pull)  mask  fr om  MSTACK . 

SETT  -  103;  {Set  MJH1  TRACE  level. 

CPHALT  -255;  {Shut  down  Control  Processor. 


{  I/O  INSTRUCTIONS.  ) 

MREAD  -  110;  {  Read  M  x  N  matrix  into  CM. 

MREADB  -  113;  {  READ  M  x  N  matrix  [B]  into  CM 
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MVRITE  «  111;  {  Write  M  x  N  matrix  stored  in  CM. 

{  DATA  MANAGEMENT  INSTRUCTIONS.  } 

SETV  -120;  {Set  range  of  CM  locations  to  value. 


»e 

dataval-  integer;  {  Data  Type  of  array  } 

matrixdim  -  0..31;  {  Max  rows  /cols  } 

matrix  -  array [matrixdim, matrixdim] 
of  dataval; 


{  data  values.  } 

PEmask  -  record  {  Bit  string  indicating  } 

bit:  a r r ay [ 0  .  . NPM1  ]  {  active  PE s.  } 

of  boolean; 

end  ; 

masks  tack  -  array[l..lO]  {  Stack  of  active  PE  masks.  } 
o  f  PEma  s  k ; 

masks tr  -  a r r ay [ 0  .  . NPM1  ]  {  String  version  of  PE  mask.  } 

of  char; 

statusword  -  array[0..3]  {  4-bit  Status  word.  } 

of  boolean;  {  See  Bel ow .  } 

{  bit  zero:  1  -  enabled/nonc  omp  1 e  t i o  n  .  } 

{  bits  1-2:  00  -  no  exception.  } 

{  01  -  arithmetic  exception:  } 

{  bit  3:  0  -  zero  divide;  } 

{  10  •  machine  exception.  } 

{  11  -  operand  addressing  exception.  } 

{  bit  3:  0  -  address  range;  } 


statustr  -  array[0..3]  {  String  version  of  status  word.} 

of  char; 

datamem  -  a  r  r  ay  [  0  .  .MEMvlAX  ]  {  local  A,B,R  memories.  } 

of  dataval; 

shiftmem  -  a r ray [ 0 . . SHREGMAX ]  {  Shift  register  memory.} 

of  dataval; 


cpmem  -  a  r  r  ay  [  0  .  .CPMEMdAX  ]  {  The  Central  Memory.  } 

of  da  t  ava  1  ; 

instruction  -  record 

opcode:integer;  {  Operation  code.  } 

opl:integer;  {  First  operand.  } 

op2:integer;  {  Second  operand.  } 

op3:integer;  {  Third  operand.  } 
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} 


opcodeset  -  set  of  O..CPHALT;  {  Valid  processor  opcodes. 


processor 


record 

PROCI D : i n t e g e r ;  {  Procesor  ID.  } 
ACC : da t aval;  {  Accumulator.  } 
MAR: integer;  {  Memory  address  reg.  } 

{  Processor  ID  for  shifting.  } 
FCS I D : i n t e g e r ;  {  Forward  circular.  } 
BCS 1 D : i n t e g e r ;  {  Backward  circular.  } 
PSS ID : i n t e g e r ;  {  Perfect  Shuffle.  } 

{  Local  Memor i e  s .  } 
MA:datamem;  {  A-operand  Memory.  } 
MB:datamem;  {  B-operand  Memory.  } 
MR:datamem;  {  Result  Memory.  } 
MS:shiftmem;  {  Shift  registers.  } 
SPORT : s h i f tmem ;  {  Inter-processor  shift  } 

{port.  } 


{  Base  Regiaters  for  Local  Mem’s} 
MAB:integer;  {  Base  register  for  MA.  } 
MBB:integer;  {  Base  register  for  MB.  } 
MRB:integer;  {  Base  register  for  MR.  } 


MAH : i n  t  ege  r ; 
MBH : i n  t  e  g  e  r ; 
MRH : integer ; 


{  Current  Bounds-Reg  for  LM’s.  } 


{  Hi  in-use  MA  address.  } 
{  Hi  in-use  MB  address.  } 
{  Hi  in-use  MR  address.  } 


STATUS : s ta tus word ;  {  condition  code  bits.  } 


end;  {  processor  record.  } 


{  INCLUDE  MJH1.PEC  } 

(,  ***«***»»  REQUIRED  VARIABLES 

') 

(*  . 

) 


va  r 

code f i 1 e : t e x t  ;  {  File  containing  MJH1  code  along  } 

{  with  inxnediate  data.  } 

oprndf i le : text  ;  {  File  containing  matrix  values.  } 

t ra c e f i 1 e : t e x t  ;  {  File  to  contain  full  processor  } 

{  executiontrace.  } 

resul tf i le  :  text ;  {  File  to  contain  "result"  (i.e.,  } 

{  data  transmitted  by  "SEND”  } 

{  instructions.  } 

TRACE : i ntege r ;  {  Trace  enable  flag.  } 
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IR : i ns t rue t  i  on  ;  {  Broadcasted  instr.  register.  } 

PE:  a r ray [ 0 . .NPM1 ]  {  The  MJH1  network  of  PE’s.  } 

of  processor; 

NP: integer;  {  Number  of  processors  in  use.  } 

NAP: integer;  {  Number  of  currently  active  PEs.  } 

A , B , R :ma  t r i x ;  {  Operand  and  result  matrices  } 

CM:cpmem;  {  The  CP  Central  Memory.  } 

CMH 1 : i n  t  e  g  e  r ;  {Highest  in-use  CM  address  } 

ACT IVEMASK : PEma  s k ;  {  Current  active  processor  mask.  } 

MSTACK : ma s k s t a c k ;  {  Stack  of  active  PE  masks.  } 

MSTKTOP : i n t e g e r ;  {  Stack  pointer  for  MSTACK.  } 

COMPLETED : bo o 1 e a n ;  {  Flag  set  when  active  PEs  } 

{  complete  instruction.  } 

PEOPCODES : opc od e s e t ;  {  Set  of  valid  PE  opcodes.  } 

CPOPCODES : opc od e s e t ;  {  Set  of  valid  CP  opcodes.  } 

i  : i n  t  e  g  e  r ;  {  Wo  rk  variable .  } 

P : I NTEGER ; 


(*  Processor  Module:  Contains  code  for  executing  the  instruction 
*) 

(*  Set,  processor  initialization,  processor-state  dumping. 

*  ) 


(.  *  Initialize  processor  registers.  *) 
procedure  ini tprocessor(var 
PE:processor;pid,fcs,bcs,pss:integer); 


var  i:integer; 


{  Wo  rk  variable.  } 


begin  {  ini tprocessor  } 

with  PE  do 
begin 

PROCID  : -  pid;  ACC  zeroval;  MAR  :«  -1; 

FCSID  fes;  BCSID  bes;  PSSID  :«  pss; 

MAB  : -  0;  MBB  : -  0;  MRB  : -  0; 

MAH  : -  0;  MBH  0;  MRH  0; 

for  i  ;■  0  to  MEMvlAX-1  do 
begin 

MA[ i ]  zeroval;  MB[i]  :«  zeroval;  MR[i]  ; 


-  zeroval 
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for  i  0  to  SHREGMAX- 1  do 
begin 

MS [ i ]  :«  zeroval;  SPORT[ i ]  zeroval; 
end; 

for  i  0  to  3  do  STATUS[I]  false; 
end;  {  wi t h  PE  } 


end;  {  initprocessor  } 


(*  Convert  4-bit  status  vector  into  character  array.  *) 
procedure  c onve r t s t a t u s 2 s t r ( va r  S  :  s t a t uswo  r d ; va r  C:statustr); 


var  irinteger; 


{  Wo r  k  variable. 


} 


begin 

for  i  :«  0  to  3  do 
if  S [  i  ] 

then  C[ i ]  trueval 

else  C [ i ]  falseval; 

end;  {  c onve r t s t a t us 2s t r  } 


(*  Convert  processor  mask  into  character  array.  *) 
procedure  c onv e r tma  sk2 s t r (NP : i n t e ge r ;  var  MASK:PEmask; 

var  MSTR :ma sks t r ) ; 


var  i:integer; 


{  Work  variable . 


begin 

for  i  0  to  NP-1  do 
i f  MASK. bi t [ i ] 

then  MSTR[i]  trueval 

else  MSTRfi]  falseval; 

end;  {  conve r tma  sk2 s t r  } 


(*  Dump  the  Control  Processor  State.  *) 
procedure  DumpCPState (NP , NAP .TRACE : integer ;  var 
ACT I VEMASK : PEma s k ) ; 


va  r 


i : i n  t  e  ge  r  ; 
ms  t  r :ma  sks  t  r  ; 
nva 1 s  :  integer 


{ 


when} 

} 

begin 

conve  r  tma sk2s  t  r (NP .ACT I VEMASK  ,ms  t  r ) ; 


{  Work  variable. 

{  String  version  of  ACTIVEMASK 
{  vals  on  current  line:  skip 
12. 
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writelnCtracefile); 

wr i t  e ( t  r  a  c  e  f i 1 e , ’ CP  STATE  — >  ’TRACE  -’,TRACE:2, 

’  NP  - • , NP : 3 , ’  NAP  - ’ , NAP : 3 , 

’  ACT1 VEMASK  ; 
for  i  : -  0  to  NP -  1  do 

writeCtracefile ,ms  t  r [ i ] : 2 ) ; 
writeln(tracefile); 

i f  TRACE  >  0  then 
begin 

n  v  a  1  s  :  «  0  ; 

wr  iteCtracefile,’  DUMP  OF  CM :  ’  , 

CMHI  :4,  ’  WDRDS  IN  USE.’); 
writeln(tracefile); 
for  i  : -  0  to  CMHI  do 
begin 

if  nvals  -  12  then 
begin 

writelnCtracefile,’  ’:8); 
nvals  : -  0 ; 
end; 

write(tracefile ,CM[ i ] : 6 ) ; 
nvals  :  -  nvals  +  1; 

end; 

end; 

end;  {  DumpCPState  } 

( *  Dump  the  processor  state  record.  *) 
procedure  d ump rocstate(var  PE:processor); 

var  STrstatustr;  {  Work  array  for  displaying  Status  bits.  } 

begin  {  d ump rocstate  } 

WITH  PE  do 
begin 

convertstatus2str( STATUS , ST ) ; 
writelnCtracefile); 

writelnCtracefile,’  ID  ACC  MAR  FC  BC  PS  MS  SP ’ , 

’  MAB  MBB  MRB  MAH  MBH  MRH  STATUS  ’  )  ; 
wr itelnCtracefile, PROCID: 3 , ACC : 4 ,MAR : 4 , 

FCS ID: 3 , BCS ID: 3 , PSS ID: 3 , 

MS [  0  ]  :  4  ,  S  PORT [ 0 ) :4, 

MAB : 4 ,MBB : 4 ,MRB : 4 ,MAH : 4 ,MBH : 4 ,MRH : 4 , 

ST [ 0 ] : 3 , ST [ 1 ] : 2 , ST [ 2 ] : 2 , ST [ 3 ] : 2 ) ; 
writelnCtracefile); 

end;  {  WITH  PE  } 

end;  {  d  ump  rocstate  } 

(*  Dump  the  specified  local  memory  *) 
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procedure  dump  1  oc  a  lmem(MEMTYPE  :  cha  r  ;  va  r 
MEM:  da  t  amem;MEMHI  :  i  n  t  e  g  e  r  )  ; 

var  i:integer;  {  Wo r  k  variable.  } 

begin  {  dump  1  oca lmem  } 

wr  iteln(tracefile); 

case  MEMTYPE  of 


wr  i t e ( t r a c e f i 1 e , *  DUMP  OF  MA 
wr  i  t e ( t r a c e f i 1 e , ’DUMP  OF  MB 
wr  i  t  e ( t  r  a  c  e  f i 1 e , ’ DUMP  OF  MR 


end;  {  case  of  MEMTYPE  } 

wr  i t e 1 n( t race f i 1 e .MEMHI : 4 , *  WORDS  IN  USE.’); 

writeln(tracefile); 

for  i  0  to  MEMHI-1  do 

write(tracefile ,MEM[  i ] : 4 ) ; 

end;  {  dumpl oca lmem  } 


(*  Dump  processor  state  and  memory.  *) 

procedure  dumpr o c e s s o r ( va r  PE : pr oc e s s o r ; TRACE : i n t ege r )  ; 
begin 

if  TRACE  >-  1 

then  dumpr oc s t a t e ( PE) ; 

i f  TRACE  > -  2  then 
with  PE  do 
begin 

dump  1  oc  a  lmem(  ’A  ’  ,MA  ,MAH )  ; 
dump  1  oc  a  lmem(  ’  B  ’  ,MB  ,MBH)  ; 
dumpl  oca  lmem(  ’  R  ’  , MR  ,MRH )  ; 
end  ; 

end;  {  d ump rocessor  } 

(*  Shift  value  V  into  I’th  position  in  processor  number  P 
procedure  shift(P,I: integer;  V:dataval); 

begin 

with  PE[P]  do 

SPORT [ I ]  V; 

end;  {  shift  } 


a  *  .  '  -  r  m  *  »  ■  »  *  ,Y  -Y  *»  .N  „V  A  A 

*  V A ^ « -'V  *\  A.  « V*  i  -  a  '  .  . -a «  *r- 
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(*  Fetch  the  next  instruction  from  the  Instruction  Stream.  * 
procedure  IfetchCvar  IR: instruction); 

begin 

writeln(output); 

wri  te(output ,  1  IR>  ’); 

with  1 R  do 

begin 

readlnCcodefile, opcode, opl,op2,op3); 
if  TRACE  >«  0  then 
begin 

writeln(out put, opcode :2,opl:3,op2:3,op3:3); 

writeln(tracefile); 

wr i t e 1 n( t race f  i  1  e  ,  *  IR> 

’  .opcode :2,opl :3,op2:3,op3:3); 
e  nd  ; 

end; 

end;  {  Ifetch  } 


(*  Execute  the  current  instruction  on  processor  PE.  *) 
procedure  PEeiecuteCvar  I R : i n s t r u c t i on ;  var  PE : p r o c e s s o r ) ; 


i  :  i  n  t  e  g  e  r  ; 
error:boolean; 


{  Local  work  variable.  } 
{  Error  flag.  } 


(*  Move  C  data  values  starting  at  SM[SB]  to  TM[TB].  *) 
procedure  move  da t a (C : i n t e g e r ; 

var  SM:  da t amem; SB : i n t e ge r ; 
var  TM:  da  t  amem;TB  :  i  n  t  e  ge  r  )  ; 
var  i:integer;  {  Wo rk  variable.  } 

begin  {  movedata  } 

for  i  :«  0  to  C-l  do 

TM[TB+i ]  SM[  SB+ i ] ; 

end ;  {  move da  ta  } 

(*  Move  C  data  values  from  MS  into  TM  starting  at  TB .  *) 
procedure  move f r omMS (C : i n t e g e r ;  var  MS:shiftmem; 

var  TM:datamem;  TB:integer); 

var  i:integer; 

begin  {  movefromMS  } 
for  i  :«  0  to  C-l  do 
TM[TB+i ]  MS  [  i  )  ; 
end;  {  movefromMS  } 


(*  Move  C  data  values  into  MS  from  SM,  starting  at  SB.  *) 
procedure  move t oMS (C: i n t e g e r ;  var  MS:shiftmem; 

var  SM:datamem;  SB'.integer); 

var  i:integer; 
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begin 

for  i  :  -  0  to  C-l  do 
MSI i 1  SM[SB+i  ]  ; 
end;  {  mo  vet  oMS  } 


begin  {  PEexecute  } 

error  false; 

WITH  PE  do 
BEGIN 

STATUS [ 1 ]  false;  STATUS [ 2 ]  false;  STATUS [ 3 ]  false; 
case  1 R . opcode  of 

00:  {  NOOP  } 
begin 
end  ; 

10:  {  ADBASE  m, i  } 
begin 

if  IR.opl  -  1  then  MAB  :«  MAB  +  IR.op2 
else  if  IR.opl  =  2  then  MBB  :»  MBB  +  !R.op2 

else  if  IR.opl  -  3  then  MRB  :«  MRB  +  IR.op2; 

end  ; 

11 :  {  SETB  m, v  } 
begin 

if  IR.opl  -  1  then  MAB  :-  IR.op2 
else  if  IR.opl  -  2  then  MBB  :«  IR.op2 

else  if  IR.opl  -  3  then  MRB  :-  IR.op2; 

end  ; 

12:  {  ALLOC  m, c  } 
begin 

if  IR.opl  -  1  then 

MAH  MAH  +  IR.op2 
else  if  IR.opl  «  2  then 
MBH  MBH  +  IR.op2 
else  if  IR.opl  -  3  then 
MRH  :-  MRH  +  IR.op2; 

end; 

IS:  {  SETMA  v  } 
begin 

MAR  : -  IR.opl; 
end  ; 

16:  {  AEMA  v  } 
begin 

MAR  MAR  +  IR.opl ; 
end  ; 

20:  {  LOADCM  m,c  } 
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begin 

if  IR.opl  -  1  then 

for  i:«  0  to  IR.op2-l  do 
MA[MAB+i]  CM[MAR+ i ] 

else  if  IR.opl  «  2  then 

for  i  0  to  IR.op2-l  do 
MB [MBB+ i ]  CM[MAR+i]; 

end; 

21:  {  STORCM  c  } 
begin 

for  i  >  0  to  IR.opl-1  do 
CM [ MAR  +  i  ]  :  -  MR  [MRB+  i  ]  ; 

end; 

22:  {  FCSHF  c  } 
begin 

for  i  :■  0  to  IR.opl-1  do 
shi f tCFCSID. i ,MS [ i ] ) ; 

end  ; 

23:  {  BCSHF  c  } 
begin 

for  i  :«  0  to  IR.opl-1  do 
shi f t(BCSID, i  ,MS [ i ] )  ; 

end; 

24:  {  PS SHF  c  } 

begin 

for  i  :«=  0  to  IR.opl-1  do 
shi f t (PSSID, i ,MS[ i ] ) ; 

end; 

25:  {  SHF  IN  c  } 
begin 

for  i  :«  0  to  IR.opl-1  do 
MS  [ i ]  SPORT [ i ] ; 

end; 

26 :  {  SHI FX  c , p} 
begin 

p  :  -  I R . op2 ; 

for  i  :•  0  to  IR.opl-1  do 
sh i f  t ( p , i ,MS  [ i ]  )  ; 

end; 

30:  {  MOVA  m,c  } 
begin 

if  IR.opl  -  2  then  moveda t a ( IR . op2 ,MB ,MBB ,MA,MAB) 
else  if  IR.opl  -  3  then 
moved  ata(IR.op2 ,MR ,MRB ,MA ,MAB ) 

else  if  IR.opl  -  4  then 
move  f  romMS(  IR  .  op2  ,MS  ,MA,MAB)  ; 
end  ; 


.'’V’Vr.'.'.v.v, 


wry 


“i’.V 
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31:  {  MCH'B  m ,  c  } 

begin 

if  IR.opl  -  1  then  moveda t a ( IR . op2 ,MA ,MAB ,MB ,MBB ) 
else  if  IR.opl  -  3  then 
moved  ata(lR.op2 ,MR ,MRB ,MB ,  MBB ) 

else  if  IR.opl  -  4  then 
mo  v  e  f  r  omMS ( I R . o  p  2 , MS , MB ,  MBB  )  ; 
e  nd  ; 

3  2:  {  MOM*  m ,  c  } 

begin 

if  IR.opl  -  1  then  moved  a t a ( I R . op2 ,MA ,MAB ,MR ,MRB ) 
else  if  IR.opl  -  2  then 
movedata( IR.op2 ,MB ,MBB ,MR ,MRB ) 

else  if  IR.opl  -  4  then 
move  f  romMS(  IR  .  op2  ,MS  ,MR  ,MRB)  ; 
end; 

33:  {  MOVS  m,c  } 
begin 

if  IR.opl  ••  1  then  move  toMS(  IR  .  op2  ,MS  ,MA,MAB) 
else  if  IR.opl  -  2  then  move toMSC IR . op2 ,MS  ,MB ,MBB) 
else  if  IR.opl  «  3  then 
move  toMS(  IR  .  op2  ,MS  ,MR  ,MRB)  ; 
end; 


40:  {  SADD  IA.IB.IR  } 
begin 

MR  [MRB+IR.op3] 

[MBB+ I R . op2  ]  ; 

end  ; 

41:  {  SSUB  IA.IB.IR  } 
begin 

MR  [MRB+ IR . op3 ] 
end  ; 


MA  [MAB+ IR.opl ]  +  MB 


MA[MAB+IR . opl ]  -  MB[MBB+IR.op2] ; 


42:  {  SMPY  IA.IB.IR  } 
begin 

MR[MRB+IR.op3] 
end  ; 


MA[MAB+IR . opl ]  *  MB[MBB+IR.op2] 


43:  {  SDI V  IA.IB.IR  } 
begin 

if  MB[MBB+IR.op2]  <>  zeroval 

then  MR[MRB+IR.op3]  MA[MAB+ IR . opl ] 

DIV  MB[MBB+IR.op2] 

else  begin 

error  :-  true;  STATUS [ 1 ]  true; 
wr  i te InCoutput , ’ZERO  DIVIDE  IN 

PE- * , PROCID: 2 ) ; 

end ; 

end ; 

44:  { SMSA  } 


iJVwVri 
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begin 

if  IR.opl  -  1  then 

MR [MRB  +  I R . op3 ]  MA[MAB  +  IR.op2]  +  MS [ i ] 
else  if  IR.opl  -  2  then 

MR  [MRB  +  I R  .  op3  ]  MA[MBB  +  IR.op2]  +  MS  [  i  ] 
else  if  IR.opl  -  3  then 

MR  [MRB  +  I R . op3  ]  MA[MRB  +  IR.op2]  +  MS  [  i  ] 

end; 

50:  {  VS ADD  i.c  } 

begin 

for  i  0  to  IR.op2-l  do 

MR  [MRB+ i ]  :«MA[MAB+i]  +  MB  [MBB+ I R . o p 1 ] ; 

end; 

51:  {  VSUB  i , c  } 
begin 

for  i  :*  0  to  IR.op2  -  1  do 

MR  [MRB+  i  ]  MA[MAB+  i  ]  -  MB[MBB+  IR.opl]  ; 

end; 

52 :  {  VSMPY  i.c  } 
begin 

for  i  :«  0  to  IR.op2  -  1  do 

MR  [MRB-t-  i  ]  :-MA[MAB+i]  *  MB  [MBB+ I R  .  opl  ]  ; 

end; 

53 :  {  VSD1V  i.c  } 
begin 

if  MB[MBB+IR . opl  ]  <>  zeroval  then 
for  i  :■  0  to  IR.op2  -  1  do 

MR  [MRB+  i  ]  :«MA[MAB+i]  DI V  MB  [MBB+  I R  .  o  p  1  ] 
else  begin 

error  :«  true;  STATUS[l]  true; 
writelnCoutput , ’ZERO  DIVIDE  IN  PE[’, 
PROCID: 1 . ’ ] ’ ) ; 

end; 

end  ; 

60:  {  VPADD  c  } 
begin 

for  i  :■  0  to  IR.opl-1  do 

MR [MRB+ i ]  MA[MAB+i ]  +  MB[MBB+ i ] ; 

end; 

61:  {  VP SUB  c  } 
begin 

for  i  0  to  IR.opl-1  do 

MR [MRB+  i  ]  :-MA[MAB+i]  -  MB [MBB+  i  ]  ; 

e  nd  ; 

62:  {  VPMPY  c  } 
begin 

for  i  :•  0  to  IR.opl-1  do 

MR [MRB+  i  ]  MA[MAB+  i  ]  *  MB [MBB+  i  ]  ; 


V  't’  .v  »’ 
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end  ; 

65:  {  INPRD  k,c} 
begin 

ACC  0; 

for  i  0  to  IR.op2-l  do 

ACC  ACC  +  MA[MAB+i]  *  MB  [MBB+ i ] ; 

MR[MRB+IR . opl ]  ACC; 
end  ; 

70:  {  TST  } 
begin 

STATUS [ 0 ]  STATUS [ 1 ]  OR  STATUS[2]  OR  STATUS[3]; 

end  ; 

end ;  {  case  } 

{  If  no  error  occurred,  clear  the  enable  bit  to  indicate  } 
{  c omp letion  of  the  instruction  execution.  } 

if  NOT  error  then  STATUS[0]  false; 

END;  {  WITH  PE  } 

end ;  {  PEexecut e  } 


(*  Read  M  x  N  matrix  into  CM,  starting  at  BASE  address.  *) 
procedure  r e adma t r i x (BASE ,M, N : i n t e ge r ; va r  CM:cpmem); 

var  i , j , k : i nt ege r ;  {  Work  variables.  } 

begin  {  readmatrix  } 

k  : «*  BASE; 

for  i  1  to  M  do 

for  j  :  -  1  to  N  do 
begin 

readCoprndf  i  le  ,  C^i[k  ]  )  ; 

A[  i  ,  j  ]  CM[  k  j  ; 
k  :-  k  +  1  ; 

end  ; 

end;  {  readmatrix  } 

procedure  Br e adma t r i x (BASE ,M,N : i nt ege r ;va r  CM:cpmem); 

var  i , j ,k : i ntege r ;  {  Work  variables.  } 

begin  {  Breadmatrix  } 

k  :-  BASE; 

for  i  :-  1  to  M  do 

for  j  :  -  1  to  N  do 
begin 
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readCoprndf i le ,CM[ k ] ) ; 
B  [  i  .  j  ]  : -  CM  [  k ] ; 
k  :  -  k  +  1  ; 


end; 

end;  {  Breadmatrix  } 


(*  Write  M  x  N  matrix  from  CM,  starting  at  BASE  address.  *) 
procedure  wr  i t ema t r i x (BASE ,M, N : i n t ege r ; va r  CM:cpmem); 

var  i,j,k:integer;  {  Work  variables.  } 

begin  {  writematrix  } 

writeln(resultfile);  writeln(resultfile); 

k  BASE; 

for  j  1  to  M  do 

begin 

writeln(resultfile); 
for  j  1  to  N  do 

begin 

A[  i  ,  j]  CM[k]  ; 

wr  i  t  e ( r  e  s  u 1 1  f i 1 e , A [ i , j ] ) ; 

k  k  +  1  ; 

end  ; 

end; 

writeln(resultfile);  writeln(resultfile); 
end;  {  writ  ema  t  r i x  } 

(*  Convert  i n t e g e r - c o de d  mask  into  boolean-string  mask.  *) 
procedure  c  onv  e  r  t  num2ma  s  k  (  NP  ,  NTJM :  i  n  t  e  g  e  r  ;  var  MASK  :  PEma  s  k  )  ; 

var  irinteger;  {  Wo rk  variable. 

q,r:integer;  {  quotient,  remainder  for  conversion  algm. 

v:  integer;  {  Wo  rk  variable:  successive  quotients. 


begin 

v  NUM; 

for  i  :«  NP - 1  down  to  0  do 
begin 

r  v  MOD  2; 

q  v  DIV  2; 

if  r  -  0 

then  MASK . b i t [ i ]  false 
else  MASK . b i t [ i ]  :•  true; 
v  :  -  q  ; 
end; 

end;  {  conve r tnum2ma sk  } 


(*  Execute  CP  Instruction. 

procedure  CPExe cu t e ( va r  I R : instruction); 


*) 


var  i  :  i  n  t  e  g  e  r  ;  {  Wo rk  variable.  } 

begin  {  CPExe  cute  } 
case  I R . opcode  of 

100:  {  ENABLE  maskcode  } 
begin 

conve  r  tnum2ma  sk(NP , IR . opl .ACTIVEMASK) 
end  ; 

101 :  {  PUSHM  } 
begin 

MSTKTOP  : -  MSTKTOP  +  1 ; 

MSTACK [MSTKTOP  ]  ACTIVEMASK  ; 
end  ; 

102:  {  PULLM  } 
begin 

ACTIVEMASK  MSTACK  [MSTKTOP  ]  ; 

MSTKTOP  : -  MSTKTOP -  1 ; 
end  ; 

103:  {  SETT  tracelevel  } 
begin 

TRACE  : «  I R . o  p 1  : 
end  ; 

110:  {  MREAD  bate .m, n  } 
begin 

r  e  a  dma  trix(IR.opl.IR.op2,IR.op3,  CM )  ; 
end; 

113:  {  MR EADB  bate ,m.c  } 

begin 

Brea  dma  trix( IR.opl  .  1 R . o  p  2  ,  IR  op3,  CM ) ; 
end; 

111:  {  MAR1TE  base.m.n  } 

begin 

wr  itemairiil  IR.opl  ,IR.op2.IR.op3 .CM) ; 
end; 

120:  {  SETV  base. count,  va  1  } 

begin 

if  I  R  .  opl ♦ I R . op2  >  CM!  I 

then  CMHI  :-  IR.opl  IR.op2; 

for  l  :-  0  to  IR.op2-l  do 
CM[  I  R  .  o  p  1  i  )  :-  IR.op3; 


end: 


IP* 
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EMULATION  TOOL  EOS  SIMULATING  MATRIX  OPERATIONS  ON 
SIMO  (SINGLE  INS  (U)  NORTH  CAROLINA  AGRICULTURAL 
TECHNICAL  STATE  UNIV  GREENSBO  H  L  MARTIN  OCT  87 
-22228  2-MA-H  DAAG29-84-G-0087  F/G  12/3  NL 


end ;  {  CPExecut  e  } 


(*  Perform  Emulator  initialization.  *) 
procedure  MJH1 I n i t  i  a  1  i ze  ; 
type  chrlen  «  1..12; 
var 

titleline:packed  array[l..72]  {  Computation  title.  } 

of  char; 

f i 1 enm: pa  eked  a r r a y [ c h r 1 e n ]  of  char; 
begin 

wr  i te ln( output . ’MJHl  START-UP!  !  * ) ; 
wr i t e (out  put ,  *  ENTER  NAME  OF  OPERAND  FILE 
readln( input .filenm); 
open(oprndfile,filenm,old); 

wr i te (out put . ’  ENTER  NAME  OF  CODE  FILE  :’); 
readln( input, file nm ) ; 
open(codefile,filenm,old); 
wr i t e ( out  put ,  '  ENTER  NAME  OF  RESULT  FILE 
readln( input ,f i 1 enm) ; 
open(resultfile,fil e  nm, new) ; 

wr i te (output , ’  ENTER  NAME  OF  TRACE  FILE  :’); 

readln( input ,f i lenm) ; 
open(tracefile, filenm, new); 

{  Open  files  and  copy  codefile  title  line  to  all  output 
{  files. 

reset(codefile);  reset(oprndfile); 
r  ewr  ite(tracefile);  r  ewr  ite(resultfile); 
readln(codefile, titleline); 

writeln(tracefile, titleline);  writeln(tracefile); 
writeln(resultfile, titleline);  writeln(resultfile); 
writeln(output); 

wr  i te (out  put , ’  ENTER  NUMBER  OF  PROCESSORS:  ’); 
readln(input.NP); 

writeln(output); 

writelnC  ENTER  PROCESSOR  TRACING  LEVEL:  *); 

wr i te ln( ’  0)  NO  TRACE. * ) ; 

writelnC  1)  PROCESSOR  STATE  ONLY’); 

writelnC  2)  FULL  PROCESSOR  TRACE’); 

readln( input , TRACE ) ; 

for  i  :-  0  to  NP- 1  do 

begin 

ini tprocesaor ( PE [ i ] , i , ( i  +  1 )  mod  NP,(NP+i-l)  mod  NP , 
(2  •  i)  mod  NP  +  (2  •  i)  div  NP  ); 

if  TRACE  >  0 

then  dumproce*sor(PE[ i ] .TRACE) ; 

end  ; 


{  Set  up  opcodes  for  PEs  and  CP.  } 


PEOPCODES  [NOOP  .ADBASE  ,  SETB  .ALLOC,  SETMA  ,ADMA ,  LOADCM, 


STORCM .  FCSHF ,  BCSHF ,  P  S  SHF ,  SHF  I N ,  MOVA ,  MOVB ,  MOVR  , 

MOVS  ,  SADD ,  S  SUB  ,  SMP Y ,  SD I V , 

VS  ADD ,  VS  SUB  ,  VSMPY ,  VSD I V ,  PVADD ,  PVSUB  ,  PVMPY , 
INPRD.TST] ; 

CPOPOODES  :«  [ENABLE  ,  PUSHM,  PULLM.MREAD.MWRITE  .  SETT,  SETV, 
CPHALT ] ; 

{  Initialize  active  processor  mask.,  and  MSTACK.  } 
for  i  0  to  NP-1  do 

ACTIVEMASK . b i t [ i ]  true; 

NAP  :«  NP; 

MSTKTOP  0; 

CMHI  0; 

end;  {  MJH1 In i t i a  1 i ze  } 


(*  Perform  MJH1  shut-down.  *) 
procedure  MJH1 Shu tDown ; 

var  i:integer;  {  Work  variable.  } 

begin 

{  Dump  all  processor  states  and  local  memories.  } 
for  i  :«  0  to  NP- 1  do 

d  ump  rocessor(PE[ i ] , TRACE+  2 ) ; 

c 1 o s e ( code f i 1 e ) ;  c 1  os e ( t r a c e f i 1 e  )  ;  c 1  os e ( r e sul t f i 1 e )  ; 

writelnCoutput  ,  ’MJH1  SHUT-DCJVN!  !  *  )  ; 

end;  {  MJHIShutDown  } 


(*  Determine  whether  all  processors  completed  last  PE 
instruction.  *) 

procedure  checkcompl e t i on(NP : i ntege r ;  var  CURMASK : PEma s k ; 

var  COMPLETED: boo  lean) ; 

var  i: integer;  {  Work  variable.  } 

begin 

COMPLETED  :■  true; 
for  i  :«  0  to  NP-1  do 

COMPLETED  COMPLETED  AND 

(CURMASK. bit[ i ]  AND  NOT 

PE[ i ] .STATUSlO] ) ; 

end;  {  che ckcompl e t i on  } 


(*  Update  the  active  mask  based  on  PE  completion  codes.  *) 
procedure  upda temask(NP : integer ;  var  ACTIVEMASK : PEma sk ) ; 
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var  i:integer;  {  wo r  k  variable.  } 
begin 

for  i  :«  0  to  NP-1  do 

i  f  ACT  I  VrEMASK  .bit  [  i  ] 

then  ACT  I VEMASK . b i t [ i ]  NOT  PE [ i ]. STATUS [ 0 ] ; 

end ;  {  upda  t ema  sk  } 


(*  Enable  processors  based  on  mask.  *) 

procedure  PEEnab 1 e (NP : i n t e ge r ;  var  CURMASK : PEma s k ; 

var  NAP : i n  t  e  g  e  r ) ; 

var  i:integer;  {  Wo  rk  variable.  } 

begin 

NAP  0; 

for  i  : -  0  to  NP - 1  do 
begin 

PE[ i ] .STATUS [0]  CURMASK . b i t [ i ] ; 
if  CURMASK. bi t [ i ]  then  NAP  NAP  +  1; 

end  ; 

end;  {  PEEnab 1 e  } 


begin  {  MJH1NEW  } 


MJH1 Initialize; 

DumpCP S  t  a  t  e ( NP , NAP , TRACE , ACT I VEMASK ) ; 


Ifetch(IR); 

while  (IR. opcode  <>  CPHALT )  AND  (NAP  >  0)  do 
begin 


if  IR. opcode  IN  PEOPCODES  then 
begin 


proceeding.  } 
} 

} 


{  Execute  PE  instruction  on  active  processors.  } 
for  i  :  -  0  to  NP-1  do 

if  ACT  I VEMASK . b i t [ i ]  then 
begin 

PEexecute(IR,PE[ i ]); 
if  TRACE  >  0 

then  dumpr oc e s s or ( PE[ i ] .TRACE ) ; 

end  ; 


{  Check  for  completion  of  all  PEs  before 
{  Disable  PEs  that  did  not  complete. 

{  NOTE:  Calls  to  Error  recovery  code  goes  here 


che ckc omp 1 e  t i on(NP ,ACT1 VEMASK .COMPLETED) ; 
if  NOT  COMPLETED 

then  upda t ema sk(NP .ACTIVEMASK) ; 
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APPENDIX  B 


CODE  FILES  FOR  THE  MJHl 

Following  are  existing  code  files  that  are  currently 
operable  on  the  MJHl.  The  first  line  of  each  program  gives 
matrix  dimensions,  as  well  as  a  coded  description  of  the 
a  Igor i thm  used . 

A  listing  of  some  variables  follows: 

M  -  #  o  f  r  ows  o  f  ma  t  r i x  A 

N  -  #  of  cols  of  Aand  r  ows  of  matrix  B 

P  -  #  of  cols  of  result  matrix 

T  *  #  of  terms  of  result  matrix 

The  code  files  are  annotated  with  simple  comments  that  are 
easily  understandable.  Careful  examination  of  these  files  should 
aid  the  potential  programmer  in  creating  code  files  of  his  own. 


ALG1 

:  2 

X 

3 

BY  3  x  2;  NP-4 .  [ TEQK ] 

0 

0 

0 

0 

For  r- 1  to  m 

0 

0 

0 

0 

for  c«l  to  p 

0 

0 

0 

0 

((r-l)p+c)th  PE  -  rth  row  of  [A] 

0 

0 

0 

0 

&  the  ctfa  col  of  [B] 

120 

0 

20 

0 

Clear  20  words  in  CM 

110 

0 

2 

3 

Read  matrix  A. 

111 

0 

2 

3 

Echo  A. 

110 

6 

3 

2 

Read  matrix  B  and  echo. 

1 1 1 

6 

3 

2 

Echo  B. 

0 

0 

0 

0 

Prepare  for  partitioning. 

100 

12 

0 

0 

Enable  PE’s  0  &  1  to  load  A[  1 ,  j  ) 

15 

0 

0 

0 

PEs  0,1  at  A[ 1 , 1 ] . 

12 

1 

3 

0 

Allocate  space  in  MA  for  3  values 

20 

1 

3 

0 

Move  3  values  into  MA  A[  1  ,  j  ] 

100 

3 

0 

0 

Enable  PEs  2  &  3 

15 

3 

0 

0 

Set  address  to  A[2,j] 

12 

1 

3 

0 

Allocate  space  in  MA  for  3  values 

20 

1 

3 

0 

Read  A[2,j]  into  MA 

100 

10 

0 

0 

Enable  PEs  0,2 

15 

6 

0 

0 

Both  PEs  can  access  B[l,l]  (coll) 

100 

5 

0 

0 

Enable  PEs  1,3 

15 

7 

0 

0 

Both  PEs  can  access  B[l,2]  (col2) 

100 

15 

0 

0 

Enable  all  PEs 

12 

2 

3 

0 

Allocate  space  for  3  col  values  of 

B 

11 

2 

0 

0 

Set  MB  base  reg  to  word  zero 

20 

2 

1 

0 

Load  1st  value  from  CM  into  MB 

10 

2 

1 

0 

Increment  MB  by  1 

16 

2 

0 

0 

I n c r erne n t  MAR  to  access  next  wo r d 

from 

CM 

20 

2 

1 

0 

Load  2nd  value  from  Of  into  MB 

10 

2 

1 

0 

Increment  LM  (MB) 

16 

2 

0 

0 

Increment  MAR  to  access  next  word 

from 

CM 

20 

2 

1 

0 

Load  3rd  value  from  (X  into  MB 

103 

3 

0 

0 

Get  a  TRACE  snapshot 

103 

0 

0 

0 

0 

0 

0 

0 

Prepare  to  multiply 

11 

1 

0 

0 

Set  LM  base  register  to  1st  word 

11 

2 

0 

0 

11 

3 

0 

0 

12 

3 

1 

0 

Allocate  room  in  MR  for  4  values 

103 

3 

0 

0 

Start  a  TRACE  snapshot 

65 

0 

3 

0 

Multiply  rows  by  cols 

0 

0 

0 

0 

STORE  4  values  from  each  PEs  R  memory 

100 

8 

0 

0 

Enable  PO 

15 

12 

0 

0 

Set  MAR  to  1st  result 

100 

4 

0 

0 

Enable  PI 

15 

13 

0 

0 

Set  MAR  to  2nd  result 

100 

2 

0 

0 

Enable  P2 

15 

14 

0 

0 

Set  MAR  to  3rd  result 

100 

1 

0 

0 

Enable  P3 

15 

15 

0 

0 

Set  MAR  to  4th  result 

100 

IS 

0 

0 

Enable  ALL  PEs 
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.V 


B 


t .. 


v. 

>. 


21 

1 

0 

0 

Store  values 

103 

0 

0 

0 

END  TRACE 

111 

12 

2 

2 

PRINT 

255 

0 

0 

0 

CPHALT 

S'  £ 


N 


N 

£ 

£ 


M 

0* 

l,'  v- 

?* 


£  ^ 

r*  ^ 


Page  B-4 


ALG2A :  2  x  3  BY  3  x  4 ;  NP-4.  [ PEQK ] 


0 

0 

0 

0 

Entire  matrix  [A]  and  one  col  of  vals 

0 

0 

0 

0 

of  [ B ]  in  order. 

120 

0 

30 

0 

Clear  30  words  in  CM 

110 

0 

2 

3 

Read  matrix  A. 

111 

0 

2 

3 

Echo  A. 

110 

6 

3 

4 

Read  matrix  B  and  echo. 

111 

6 

3 

4 

Echo  B. 

0 

0 

0 

0 

Prepare  foT  Partitioning. 

100 

15 

0 

0 

Enable  PE’s  0  &  1  to  load  A[ 1 , j ] 

15 

0 

0 

0 

PEs  0,1  at  A[ 1.1]. 

12 

1 

6 

0 

Allocate  space  in  MA  for  6  values 

20 

1 

6 

0 

Move  6  values  into  MA  A[ i , j ] 

100 

8 

0 

0 

Enable  PE  0 

15 

6 

0 

0 

Set  address  to  B [ 1 . 1 ] 

100 

4 

0 

0 

Enable  PE  1 

15 

7 

0 

0 

Set  address  to  B[l,2] 

100 

2 

0 

0 

Enable  PE  2 

15 

8 

0 

0 

Set  address  to  B[ 1 ,3] 

100 

1 

0 

0 

Enable  PE  1 

15 

9 

0 

0 

Set  address  to  B [ 1 , 4 ] 

100 

15 

0 

0 

Enable  all  PEs 

103 

3 

0 

0 

Start  TRACE 

12 

2 

3 

0 

Allocate  space  for  3  col  values  of  B 

20 

2 

1 

0 

Load  1st  value  from  CM  into  MB 

10 

2 

1 

0 

Increment  MB  by  1 

16 

4 

0 

0 

Increment  MAR  to  access  next  word  from 

CM 

20 

2 

1 

0 

Load  2nd  value  from  CM  into  MB 

10 

2 

1 

0 

Increment  LM  (MB) 

16 

4 

0 

0 

I n  c  r  erne  n  t  MAR  to  access  next  wo  r  d  from 

CM 

20 

2 

1 

0 

Load  3rd  value  from  CM  into  MB 

10 

2 

1 

0 

Increment  MB  by  1 

0 

0 

0 

0 

Prepare  to  mu  1 1 i p 1 y 

1  1 

1 

0 

0 

Set  LM  base  register  to  1st  wo r d 

1  1 

2 

0 

0 

11 

3 

0 

0 

12 

3 

2 

0 

Allocate  room  in  MR  for  2  values 

65 

0 

3 

0 

Multiply  rows  by  cols  A[l,j] 

100 

8 

0 

0 

Enable  PO 

15 

18 

0 

0 

Set  MAR  to  1st  result 

100 

4 

0 

0 

Enable  PI 

15 

19 

0 

0 

Set  MAR  to  2nd  result 

100 

2 

0 

0 

Enable  P2 

IS 

20 

0 

0 

Set  MAR  to  3rd  result 

100 

1 

0 

0 

Enable  P3 

15 

21 

0 

0 

Set  MAR  to  4th  result 

100 

15 

0 

0 

Enable  ALL  PEs 

21 

1 

0 

0 

Store  values 

0 

0 

0 

0 

Prepare  to  continue  multiplication 

1 1 

2 

0 

0 

Set  LM  base  register  to  3rd  word 

1 1 

1 

3 

0 

Set  LM  base  reg  A  to  1st  word 

65 

0 

3 

0 

Multiply  rows  by  cols  A[  2 ,  i  ] 
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100  8  00 
15  22  U  0 
100  4  00 

15  23  00 

100  2  00 
15  24  0  0 

100  1  00 
15  25  00 

100  15  0  0 

21  1  0  0 
111  18  2  4 

103  0  0  0 

255  000 


Enable  P0 
Set  MAR  t o 
Enable  Pi 
Set  MAR  t o 
Enable  P2 
Set  MAR  t o 
Enable  PI 
Set  MAR  t o 
Enable  ALL 
Send  value 
PRINT 

STOP  TRACE 
CPHALT 


5th  result 
6th  result 
7th  result 
final  result 


ALG2B :  2  x  3  BY  3  x  4 ;  NP-2.  [ PMULTK ] 


0 

0 

0 

0 

Entire  matrix  [A]  and  p/k  cols 

0 

0 

0 

0 

of  vals  of  [B]  in  order 

120 

30 

0 

0 

Clear  30  words  in  CM 

110 

0 

2 

3 

Read  matrix  A. 

111 

0 

2 

3 

Echo  A. 

110 

6 

3 

4 

Read  matrix  B  and  echo. 

111 

6 

3 

4 

Echo  B. 

0 

0 

0 

0 

Prepare  for  partitioning 

100 

3 

0 

0 

Enable  PE’s  0  &  1  to  load  A[l,j] 

15 

0 

0 

0 

PEs  0, 1  at  A[ 1 , 1 ] . 

12 

1 

6 

0 

Allocate  space  in  MA  for  6  values 

20 

1 

6 

0 

Move  6  values  into  MA  A[i,j] 

100 

2 

0 

0 

Enable  PE  0 

15 

6 

0 

0 

Set  address  to  B[l,l] 

100 

1 

0 

0 

Enable  PE  1 

15 

8 

0 

0 

Set  address  to  B[l,3] 

100 

3 

0 

0 

Enable  both 

103 

3 

0 

0 

Start  TRACE 

12 

2 

6 

0 

Allocate  space  for  6  col  values  of 

B 

1 1 

2 

0 

0 

set  MB  to  0 

20 

2 

1 

0 

Load  1st  value  from  CM  into  MB 

10 

2 

1 

0 

Increment  MB  by  1 

16 

4 

0 

0 

Increment  MAR  to  access  next  word 

from 

CM 

20 

2 

1 

0 

Load  2nd  value  from  CM  into  MB 

10 

2 

1 

0 

Increment  LM  (MB) 

16 

4 

0 

0 

Increment  MAR  to  access  next  word 

from 

CM 

20 

2 

1 

0 

Load  3rd  value  from  CM  into  MB 

10 

2 

1 

0 

Increment  MB  by  1 

100 

2 

0 

0 

Enable  PE  2 

15 

7 

0 

0 

Set  MAR  to  B[ 1 ,2] 

100 

1 

0 

0 

Enable  PE  1 

15 

9 

0 

0 

Set  MAR  to  B[ 1 ,4] 

100 

3 

0 

0 

Enable  ALL 

20 

2 

1 

0 

Read  4th  va 1 

16 

4 

0 

0 

Increment  MAR 

10 

2 

1 

0 

Increment  MB 

20 

2 

1 

0 

Load  Sth  va 1 

16 

4 

0 

0 

Increment  MAR 

10 

2 

1 

0 

Increment  MB 

20 

2 

1 

0 

Load  next  word 

0 

0 

0 

0 

Prepare  to  multiply 

11 

1 

0 

0 

Set  LM  base  register  to  1st  word 

11 

2 

0 

0 

1 1 

3 

0 

0 

12 

3 

4 

0 

Allocate  room  in  MR  for  4  values 

65 

0 

3 

0 

Multiply  rows  by  cols  A[l,j] 

11 

1 

0 

0 

Set  LM  to  r ow  1 

11 

2 

3 

0 

Set  LM  to  col  2 

65 

1 

3 

0 

1  1 

1 

3 

0 

Set  to  row  2 

1 1 

2 

0 

0 

Set  to  col  1 

Page  B- 7 


65 

2 

3 

0 

1 1 

1 

3 

0 

Set  to  r ow  2 

1 1 

2 

3 

0 

Set  to  col  2 

65 

3 

3 

0 

100 

2 

0 

0 

Enable  PO 

15 

18 

0 

0 

Set  MAR  to  1st 

result 

100 

1 

0 

0 

Enable  PI 

15 

20 

0 

0 

Set  MAR  to  2nd 

result 

100 

3 

0 

0 

Enable  all 

21 

2 

0 

0 

Store  values 

100 

2 

0 

0 

15 

22 

0 

0 

Set  to  C[ 2 , 1 ] 

100 

1 

0 

0 

15 

24 

0 

0 

Set  to  C[ 2 , 1 ] 

100 

3 

0 

0 

1 1 

3 

2 

0 

Set  to  3rd  value  in  MR 

21 

2 

0 

0 

1 1  1 

1  8 

2 

4 

PRINT 

103 

0 

0 

0 

STOP  TRACE 

255 

0 

0 

0 

CPHALT 
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ALG 3 A :  4  x  1  BY  1  x  2 ;  NP-4.  [MEQK] 


0  0  0  0 


One  row  of  values  of  [A]  and  entire  [B] 


120  16  0  0 

110  0  4  1 

111  0  4  1 

110  4  1  2 

111  4  1  2 


Clear  16  words  in  CM 
Read  ma  t  r i x  A . 

Echo  A. 

Read  matrix  B  and  echo. 
Echo  B . 


0  0 
100  15 
15  4 


100  8 
15  0 


100  15 
103  3 

12  1 
20  1 


0  0 
0  0 
0  0 


0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
1  0 


Start  alocation 

Enable  ALL  PE’s  to  load  B 

PEs  at  B [ 1 , 1 ] • 

Allocate  space  in  MB  for  2  values 
Move  2  values  into  MB 
Enable  PE  0 
Set  address  to  A[  1.1] 

Enable  PE  1 

Set  address  to  A[2,l] 

Enable  PE  3 
A  [  3 , 1  ] 

Enable  PE  4 
A[  4 , 1  ] 

Enable  ALL 
Start  TRACE 
Allocate  space  for  A 
Load  various  rows  into  MA 


0  0 

1  1  1 


65  0 

1  1  1 
1 1  2 
65  1 

100  8 
15  6 

100  4 

15  8 

100  2 
15  10 
100  1 
15  12 
100  15 
11  3 

21  1 
100  8 


0  0 
0  0 
0  0 
0  0 
2  0 
1  0 
0  0 
1  0 
1  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 
0  0 


Prepare  to  multiply 

Set  MA  to  0 

Set  MB  to  0 

Set  MR  to  0 

Allocate  room  in  MR  f( 

Mu  1 1  i  p 1 y 

Reset  B 

Enable  PO 

Set  MAR  to  1st  result 
Enable  PI 

Set  MAR  to  2nd  result 
Enable  PE  2 


values 


Enable  PE  1 
Se  t  MAR  to  4th 
Enable  ALL 


result 


Store 


result 


15  11 


0  0 
0  0 


r«Trt,4!r*, 


W 


TO 
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15  13  0  0 

100  15  0  0 

11  3  1  0 

21  1  0  0 
111  6  4  2 

103  000 

255  000 


Enable  ALL 

Store  other  vals  of 
PRINT 

STOP  TRACE 
CPHALT 


ALG3B :  4  x  1  BY  1  x  2 ;  NP-2.  [MvflJLTK] 

0000  (m/k)  rows  of  [A]  and  entire  [B] 

120  16  0  0  Clear  16  words  in  CM 

110  0  4  1  Read  matrix  A. 

111  0  4  1  Echo  A. 

110  4  1  2  Read  matrix  B  and  echo. 

111  4  1  2  Echo  B. 

0000  Start  alocation 

100  300  Enable  ALL  PE’s  to  load  B 

15  4  0  0  PEs  at  B[l,l]. 

12  2  2  0  Allocate  space  in  MB  for  2  values 

20  2  2  0  Move  2  values  into  MB 

100  200  Enable  PE  0 

15  0  0  0  Set  address  to  A[ 1 , 1 ] 

100  1  0  0  Enable  PE  1 

15  2  0  0  Set  address  to  A[3,l] 

100  300  Enable  ALL  PEs 

103  3  0  0  Start  TRACE 

12  1  2  0  Allocate  space  for  A 

20  1  2  0  Loadvarious  rows  into  MA 

0  0  0  0  Prepareto  multiply 

11100  Set  MA  to  0 

11  2  0  0  Set  MB  to  0 

11300  Set  MR  to  0 

12  3  4  0  Allocate  room  in  MR  for  2  values 

65  0  1  0  Multiply 

11  1  0  0  Reset  B 

11  2  1  0 

65  1  1  0 

11  1  1  0 

11  2  0  0 

65  2  1  0 

11  1  1  0 

11  2  1  0 

65  3  1  0 

100  200  Enable  PO 

IS  6  0  0  Set  MAR  to  1st  result 

100  100  Enable  PI 

15  10  0  0  Set  MAR  to  2nd  result 

100  300  Enable  ALL  PEs 

11  3  0  0 

21  4  0  0  Store  1st  results 

111  6  4  2  PRINT 

103  0  0  0  STOP  TRACE 

255  0  0  0  CPHALT 
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ALG4A : 

2x2 

by 

2x4;  NP-4  [TMULTK ] 

o 

0 

0 

0 

For  r-1  tom 

1  E 

0 

0 

0 

0 

{  for  c  -  1  to  q  (k-qm) 

m 

0 

0 

0 

0 

{  ((r-l)q+c)th  PE  -  rth  row  [A] 

fll 

0 

0 

0 

0 

and  cth  string  of  j  cols  of  [B] 

t 

0 

0 

0 

0 

in  orde  r  } } 

i 

120 

0 

24 

0 

Allocate  space  for  A.B.R 

L 

110 

0 

2 

2 

Read  A 

111 

0 

2 

2 

Echo  A 

110 

4 

2 

4 

Read  B 

1 1 1 

4 

2 

4 

Echo 

Xl 

1 

0 

0 

0 

0 

Prepare  to  Partition 

12 

1 

2 

0 

Allocate  2  spaces  in  A 

V  •, 

100 

12 

0 

0 

Enable  PEs  1,2 

15 

0 

0 

0 

Set  MAR  to  origin 

20 

1 

2 

0 

Read  1st  row  of  A  into  PEs  1,2 

100 

3 

0 

0 

Enable  PE  3,4 

_* 

15 

2 

0 

0 

Set  MAR  to  r ow  2  o f  A 

*  • 

20 

1 

2 

0 

Read  2nd  row  of  A  into  PE  3,4 

» 

0 

0 

0 

0 

Prepare  to  load  cols 

m 

100 

10 

0 

0 

Enable  PE  1,3 

I: 

15 

4 

0 

0 

Set  MAR  to  4  B [ 1 . 1 ] 

1 

100 

5 

0 

0 

Enable  PE  2,4 

15 

6 

0 

0 

Set  MAR  to  6  B[l ,3] 

100 

15 

0 

0 

Enable  ALL  PEs 

N  « 

11 

2 

0 

0 

Set  MB  to  0 

,9S 

12 

2 

4 

0 

Allocate  4  values  in  B 

r  £ 

20 

2 

1 

0 

Read  1  value 

fc  t 

10 

2 

1 

0 

Increment  MB  by  1 

16 

4 

0 

0 

Increment  MAR  by  4 

L  m 

20 

2 

1 

0 

Read  1  value 

1  1 

10 

2 

1 

0 

Advance  MB 

100 

10 

0 

0 

Enable  PE  1,3 

15 

5 

0 

0 

Set  MAR  to  2od  col 

F  :•■ 

100 

5 

0 

0 

<9S 

15 

7 

0 

0 

* 

100 

15 

0 

0 

L  L 

20 

2 

1 

0 

Read  in  next  value 

Mgg 

10 

2 

1 

0 

I  nc  r  erne  n  t  LM  B 

*<■ 

16 

4 

0 

0 

Advance  MAR  to  next  col  value 

$r 

20 

2 

1 

0 

Read  in  next  value 

0 

0 

0 

0 

Perform  multiplication 

F 

1 1 

1 

0 

0 

N 

11 

2 

0 

0 

Set  LM  base  register  to  1st  word 

11 

3 

0 

0 

n 

0 

0 

0 

0 

Prepare  to  multiply 

65 

0 

2 

0 

Mu ltiply  1st  value 

fS 

1 1 

2 

2 

0 

Increment  MB  to  2nd  col 

■ 

11 

1 

0 

0 

Do  A  again 

E  . 

65 

1 

2 

0 

Multiply  again  (2nd  val) 

p  V 

0 

0 

0 

0 

Prepare  to  PRINT  result 

1 

100 

8 

0 

0 

Enable  PE  1 

£ 

IS 

mrntmr- 

m 

15  12  0  0  Position  for  [1,1] 

100  400  Enable  PE  2 

15  14  0  0  Position  for  [1,3] 

100  200  Enable  PE  3 

15  16  0  0  Position  for  [2,1] 

100  1  0  0  Enable  PE  4 

15  18  0  0  Position  for  [2,2] 

100  15  0  0  Enable  ALL  PEs 

21  2  0  0  Send  values 

111  12  2  4  PRINT 

255  0  0  0  CPHALT 
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ALG4B : 

4x2 

by 

2x2;  NP-4  [PSUBK] 

B 

1  0 

0 

0 

0 

forq-1  top 

0 

0 

0 

0 

{  for  c  »  1  tor 

0 

0 

0 

0 

{((q-l)r+c)th  PE  - 

\ 

0 

0 

0 

0 

cth  string  of  j  rows  of  [A] 

0 

0 

0 

0 

in  order  and  qth  col  of  [B] 

120 

24 

0 

0 

Allocate  space  for  A,B,C 

? 

1  110 

0 

4 

2 

Read  A 

111 

0 

4 

2 

Echo 

* 

110 

8 

2 

2 

Read  B 

111 

8 

2 

2 

Echo 

0 

0 

0 

0 

Partition  Matrices 

100 

10 

0 

0 

Enabl e  PEs  1 , 3 

15 

0 

0 

0 

Set  MAR  to  origin 

- 

12 

1 

4 

0 

Allocate  4  values  in  A 

20 

1 

4 

0 

Read  1st  2  rows  of  A 

100 

5 

0 

0 

Enable  PEs  2,4 

12 

1 

4 

0 

Allocate  4  more  values 

ia*- 

15 

4 

0 

0 

Set  MAR  up  by  4 

20 

1 

4 

0 

Read  in  2nd  2  rows  of  A 

100 

12 

0 

0 

Enabl e  PEs  1 , 2 

V 

15 

8 

0 

0 

Set  MAR  to  B[ 1 , 1 ] 

100 

3 

0 

0 

Enable  PEs  3,4 

V 

15 

9 

0 

0 

Set  MAR  to  B [ 1 .2] 

a 

100 

15 

0 

0 

Enable  ALL 

12 

2 

2 

0 

Allocate  2  values  in  B 

20 

2 

1 

0 

Read  in  1st  col  value 

16 

2 

0 

0 

Increment  MAR  by  2 

10 

2 

1 

0 

Increment  MB  by  1 

20 

2 

1 

0 

Read  in  2nd  col  value 

0 

0 

0 

0 

Prepare  to  multiply 

1  1 

1 

0 

0 

1 1 

2 

0 

0 

Set  ALL  LM  bases  to  0 

1 1 

3 

0 

0 

12 

3 

2 

0 

Al’ote  values  in  MR 

!v 

65 

0 

2 

0 

Multiply  once  for  1st  row  value 

1 1 

1 

2 

0 

,v 

11 

2 

0 

0 

Reset  MB  to  top  of  col 

65 

1 

2 

0 

Multiply  again  for  2nd  col  value 

.> 

y 

0 

0 

0 

0 

Prepare  to  write  out  result 

100 

8 

0 

0 

Enable  PE  1 

15 

12 

0 

0 

Set  MAR  to  1st  result  in  RESULT  MATRIX 

- 

100 

4 

0 

0 

Enable  PE  2 

> 

15 

16 

0 

0 

Set  MAR 

100 

2 

0 

0 

Enable  PE  3 

15 

13 

0 

0 

Set  MAR 

100 

1 

0 

0 

Enable  PE  4 

L 

15 

17 

0 

0 

Set  MAR 

Send  values 
Increment  MAR  by  2 


ALG5A :  3x4  by  4x2;  NP-4  [TGTKNEQK] 


I 


ts 


■r  ' 


Si 


s: 

s 

I 

I  *, 

$ 

J- 

> 

> 

£ 


K 


,v 

.  - 


i* 


0 

0 

0 

0 

one  col  of  [A]  and  one  r ow  of  [ B ] 

0 

0 

0 

0 

Load  ma  trices 

120 

30 

0 

0 

Allocate  space  in  CM 

110 

0 

3 

4 

Read  i n  A 

111 

0 

3 

4 

Echo 

110 

12 

4 

2 

Read  B 

1 1 1 

12 

4 

2 

Echo 

0 

0 

0 

0 

Partition  Matrices 

100 

8 

0 

0 

Enable  PE  1 

15 

0 

0 

0 

Set  MAR  to  origin 

100 

4 

0 

0 

Enable  PE  2 

15 

1 

0 

0 

Set  MAR  to  A[ 1 ,2] 

100 

2 

0 

0 

Enable  PE  3 

15 

2 

0 

0 

Set  MAR  to  A[ 1 , 3] 

100 

1 

0 

0 

Enable  PE  4 

15 

3 

0 

0 

Set  MAR  to  A[ 1 , 4 ] 

100 

15 

0 

0 

Enable  ALL 

12 

1 

3 

0 

Allocate  3  spaces  in  MA 

20 

1 

1 

0 

Read  in  1st  col  value  of  A 

16 

4 

0 

0 

Increment  MAR 

10 

1 

1 

0 

I  nc  r  ement  MA 

20 

1 

1 

0 

Read  2nd  col  value  of  A 

16 

4 

0 

0 

Increment  MAR 

10 

1 

1 

0 

Increment  MA 

20 

1 

1 

0 

Read  3rd  col  value  of  A 

100 

8 

0 

0 

Enable  PE  1 

15 

12 

0 

0 

Set  MAR  to  r ow  1  o f  B 

100 

4 

0 

0 

Enable  PE  2 

15 

14 

0 

0 

Set  MAR  to  r ow  2  of  B 

100 

2 

0 

0 

Enable  PE  3 

15 

16 

0 

0 

Set  MAR  to  r ow  3  o f  B 

100 

1 

0 

0 

Enable  PE  4 

15 

18 

0 

0 

Set  MAR  to  r ow  4  o f  B 

100 

15 

0 

0 

Enable  ALL 

12 

2 

2 

0 

20 

2 

2 

0 

Read  1  r ow  of  B 

0 

0 

0 

0 

Prepare  to  Multiply 

1 1 

1 

0 

0 

11 

2 

0 

0 

Set  LM  bases  to  0 

1 1 

3 

0 

0 

12 

3 

6 

0 

Allocate  6  values  in  MR 

65 

0 

1 

0 

Multiply  to  find  1st  INPRD 

11 

1 

1 

0 

Increment  MA 

11 

2 

0 

0 

Dec  MB  (back,  at  origin) 

65 

1 

1 

0 

Multiply  2nd  col  value 

11 

1 

2 

0 

Increase  MA  to  3rd  value 

1 1 

2 

0 

0 

Keep  MB  at  origin 

65 

2 

1 

0 

Multiply  3rd  col  value  of  INPRD 

11 

1 

0 

0 

De c  MA  to  origin 

1 1 

2 

1 

0 

Inc  MB  to  2nd  col 

65 

3 

1 

0 

Multiply  4th  INPRD 
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Multiply  5th  INPRD 


Multiply  6th  INPRD 

Prepare  for  1  NTER  PROCESSOR  COhMJN  1  CAT  I  ON 

Reset  LMs 

Move  MR  to  MA 
Put  one  value  on  SPORT 
Shift  f  o  rwa  rd 
Receive  shifted  data 
Put  MS  into  MB 

Add  previous  result  to  MS;Store  in  MR[0] 
Move  result  to  MA 
Shift  f  o  rwa  r  d 
Receive  shifted  data 
Put  MS  into  MB 
Add  previous  result  to  MS 
Move  result  to  MA 
Shift  f  o  rwa  r  d 
Receive  shifted  data 
Put  MS  into  MB 
Add  previous  result  to  MS 
Get  2nd  FINAL  RESULT 

Reset  LMs 

Move  MR  to  MA 
Put  one  value  on  SPORT 
Shift  forward 
Receive  shifted  data 
Put  MS  into  MB 

Add  previous  result  to  MS;Store  in  MR[l] 
Move  result  to  MA 
Shift  f  o  rwa  r  d 
Receive  shifted  data 
Put  MS  into  MB 
Add  previous  result  to  MS 
Mo ve  result  to  MA 
Shift  f  o  rwa  r d 
Receive  shifted  data 
Put  MS  into  MB 
Add  previous  result  to  MS 
Get  3rd  FINAL  RESULT 

Reset  LMs 

Move  MR  to  MA 
Put  one  value  on  SPORT 
Shift  forward 
Receive  shifted  data 
Put  MS  into  MB 


40  0  0  0  Add  previous  result  to  MS;Store  in  MR[l] 

30  3  1  0  Move  result  to  MA 

22  1  1  0  Shiftfo  rwa  r  d 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS 

30  3  1  0  Move  result  to  MA 

22  1  1  0  Shiftfo  rwa  r  d 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS 

0  0  0  0  Get  4th  FINAL  RESULT 

11  1  3  0 

11  2  3  0  Reset  LMs 

11  3  3  0 

30  3  1  0  Move  MR  to  MA 

33  3  1  0  Putonevalueon  SPORT 

22  1  1  0  Shift  forward 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS; Store  in  MR[l] 

30  3  1  0  Mo ve  result  to  MA 

22  1  1  0  Shiftfo  rwa  r  d 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS 

30  3  1  0  Move  result  to  MA 

22  1  1  0  Shift  forward 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS 

0000  Get  5th  FINAL  RESULT 

11  1  4  0 

11  2  4  0  Reset  LMs 

11  3  4  0 

30  3  1  0  Move  MR  to  MA 

33  3  1  0  Put  one  value  on  SPORT 

22  1  1  0  Shiftfo  rwa  r  d 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS;  Store  in  MR[l] 

30  3  1  0  Move  result  to  MA 

22  1  1  0  Shiftfo  rwa  r  d 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS 

30  3  1  0  Move  result  to  MA 

22  1  1  0  Shift  forward 

25  1  1  0  Receive  shifted  data 

31  4  1  0  Put  MS  into  MB 

40  0  0  0  Add  previous  result  to  MS 

0000  Get  6th  FINAL  RESULT 

11  1  5  0 

11  2  5  0 

11  3  5  0 


Reset  LMs 
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22  1 
25  1 


25  1 
31  4 
40  0 


40  0 

100  8 
1 1  3 

15  20 
100  4 

1 1  3 

15  21 
100  2 

I  1  3 

15  22 

100  1 

II  3 

15  23 
100  15 

21  1 
100  12 
10  3 

16  4 
21  1 

111  20 
255  0 


1  0 


1  0 
1  0 

0  0 


1  0 
1  0 
1  0 
0  0 
1  0 
1  0 
1  0 
1  0 
0  0 
0  0 
0  0 
0  0 
0  0 

3  0 
0  0 
0  0 
1  0 
0  0 
0  0 

4  0 
0  0 
0  0 
0  0 
0  0 


0  0 
3  2 

0  0 


Move  MR  to  MA 
Put  one  value  on  SPORT 
Shift  f  o  rwa  r d 
Receive  shifted  data 
Put  MS  into  MB 

Add  previous  result  to  MS;  Store  in  MR[l] 
Move  result  to  MA 
Shift  f  o  rwa  r d 
Receive  shifted  data 
Put  MS  into  MB 
Add  previous  result  to  MS 
Move  result  to  MA 
Shift  forward 
Receive  shifted  data 
Put  MS  into  MB 
Add  previous  result  to  MS 
Enable  PE  1 

Set  MAR  to  1st  result 
Enable  PE  2 

Set  MAR  to  2nd  result 
Enable  PE  3 

Set  MAR  to  3rd  result 
Enable  PE  4 

Set  MAR  to  4th  result 
Enable  ALL 
Send  1  value 
Ena  b  1  e  PEs  1 , 2 


PRINT 

CPHALT 
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ALG5B : 

3x8 

by 

8x2;  NP-4  [ TGTKNMULTK ] 

f 

n 

0 

0 

0 

0 

(n/k)  cols  of  [A]  and 

B 

0 

0 

0 

0 

(n/k)  rows  of  [B]  in  order 

0 

0 

0 

0 

Load  matrices 

120 

48 

0 

0 

Allocate  space  for  48  spaces 

-"J 

110 

0 

3 

8 

Read  A 

111 

0 

3 

8 

Echo 

110 

24 

8 

2 

Read  B 

S 

1 1 1 

24 

8 

2 

Echo 

0 

0 

0 

0 

Partition  matrices 

•\ 

100 

8 

0 

0 

Enable  PE  1 

v\ 

15 

0 

0 

0 

Set  MAR  to  origin 

100 

4 

0 

0 

Enable  PE  2 

m 

15 

2 

0 

0 

Set  MAR  to  A[ 1 , 3 ] 

;/ 

100 

2 

0 

0 

Enable  PE  3 

15 

4 

0 

0 

Set  MAR  to  A[1 ,5] 

100 

1 

0 

0 

Enable  PE  4 

A* 

15 

6 

0 

0 

Set  MAR  to  A[1 ,7] 

100 

15 

0 

0 

Enable  ALL 

12 

1 

6 

0 

Allocate  enough  space  for  6  values 

V 

20 

1 

2 

0 

Read  1st  2  values 

16 

8 

0 

0 

Increment  MAR 

10 

1 

2 

0 

Increment  MA 

- 

20 

1 

2 

0 

Read  2nd  col  value 

n 

16 

8 

0 

0 

Increment  MAR 

u 

10 

1 

2 

0 

Increment  MA 

20 

1 

2 

0 

Read  3rd  col  value 

y 

100 

8 

0 

0 

Enable  PE  4 

V 

k  w 

15 

24 

0 

0 

Beginning  of  B[l,l] 

100 

4 

0 

0 

Enable  PE  3 

■ 

15 

28 

0 

0 

| 

100 

2 

0 

0 

Enable  2 

15 

32 

0 

0 

B- 

100 

1 

0 

0 

Enable  1 

V 

15 

36 

0 

0 

■•* 

100 

15 

0 

0 

Enable  ALL 

12 

2 

4 

0 

Allocate  4  values  in  B 

1, 

20 

2 

1 

0 

Read  in  1  value 

V 

10 

2 

1 

0 

16 

2 

0 

0 

.- 

20 

2 

1 

0 

y 

100 

8 

0 

0 

15 

25 

0 

0 

100 

4 

0 

0 

•;- 

15 

29 

0 

0 

■-’ 

100 

2 

0 

0 

15 

33 

0 

0 

>. 

100 

1 

0 

0 

&  15 

37 

0 

0 

100 

15 

0 

0 

/ 

10 
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ALG5DI :  3X2  by  2X3;  NP-4  [ TGTKNSUBKBMGTP ] 

SO  FAR,  THIS  WON’T  WORK  FOR  THIS  CASE.  THERE  SEEMS  TO  BE 
SOMETHING  WRONG  WITH  THE  ALGORITHM.  IT  IS  NOT  ALLOWING  FOR  ROW  2 
OF  MATRIX  [A]  TO  BE  ALLOCATED  TO  ANY  PE. 


ALG5DII:  3x2  by  2x5;  NP  -  4  [ TGTKN SUBKMLTP ] 
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Reset  LMs 


Move  MR  to  MA 

Put  one  value  on  SPORT 
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22  1  1  0 
25  1  1  0 

31  4  1  0 

40  0  0  0 

0  0  0  0 
11  1  3  0 

11  2  3  0 

11  3  3  0 

30  3  1  0 

33  3  1  0 

22  1  1  0 
25  1  1  0 

31  4  1  0 

40  0  0  0 

0  0  0  0 
11  1  4  0 

11  2  4  0 

11  3  4  0 

30  3  1  0 

33  3  1  0 

22  1  1  0 
25  1  1  0 

31  4  1  0 

40  0  0  0 

0  0  0  0 
11  1  2  0 
11  2  2  0 
11  3  2  0 

30  3  1  0 

33  3  1  0 

22  1  1  0 
25  1  1  0 

31  4  1  0 

40  0  0  0 

0  0  0  0 
11  3  0  0 

100  4  0  0 

15  16  0  0 

100  1  0  0 
15  17  0  0 

100  2  0  0 
15  18  0  0 

100  1  0  0 

15  19  0  0 

100  15  0  0 

21  1  0  0 
10  3  1  0 

16  5  0  0 

21  1  0  0 
10  3  1  0 

16  5  0  0 

21  1  0  0 

100  400 

10  3  1  0 

15  20  0  0 


Shift  forward 
Receive  shifted  data 
Put  MS  into  MB 

Add  previous  result  to  MS;  St  ore  in  MR[l] 
Get  4th  FINAL  RESULT 

Reset  LMs 

Mo  v  e  MR  t  o  MA 
Put  one  value  on  SPORT 
Shift  f  orwa  rd 
Receive  shifted  data 
Put  MS  into  MB 

Add  previous  result  to  MS  ; S  t  o  r  e  in  MR  [  1  ] 
Get  4th  FINAL  RESULT 

Reset  LMs 

Move  MR  t o  MA 
Put  one  value  on  S PORT 
Shift  forward 
Receive  shifted  data 
Put  MS  into  MB 

Add  previous  result  to  MS;  St  ore  in  MR[l] 
Get  5th  FINAL  RESULT 

Reset  LMs 

Move  MR  to  MA 
Put  one  value  on  SPORT 
Shift  f  o  rwa  r  d 
Receive  shifted  data 
Put  MS  into  MB 

Add  previous  result  to  MS;Store  in  MR[l] 
Prepare  to  output 
Reset  Result  memory 
Enable  PE  2 

Set  result  to  1st  location 
Enable  PE  4 

Set  result  to  2nd  3  col  values 
Enable  3 


Read  1  value 
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0 

0 

0 

For  s-l  to  i 
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0 

0 

0 

{  sth  string  of  m  PEs  get 
m  rows  in  order  of  [AJ  and. 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

for  c-0  to  j 

{  (ck/m+s)th  col  of  vals  of  [B] 

0 

0 

0 

0 

ADD’L  VALS: 

^  0 

ft  o 

0 

0 

0 

for  s-(i+l)  to  k/m  (if  s>p) 

0 

0 

0 

{  sth  string  of  m  processors  get  m  r ows 

■V-.  0 

0 

0 

0 

of  [A]  in  order  and, 

*£•  0 

0 

0 

0 

for  c-0  to  ( j - 1  ) 

ft  o 

0 

0 

0 

{(ck/m+s)th  col  of  vals  of  [B] 

&  0 

0 

0 

0 

Load  ma  trices 

m  120 

24 

0 
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Allocate  24  values 

p  110 

0 

2 

3 

Read  A 
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0 

2 

3 

Echo 

ft  no 

6 

3 

3 

Read  B 

ft  1 1 1 
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6 

3 

3 

Echo 

•  0 

0 

0 

0 

Prepare  to  partition  A  and  B 
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15 

0 

0 

Enable  ALL 

1  5 

0 

0 

0 
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ft  1 2 

1 

6 

0 

Allocate  6  values 
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1 

6 

0 

Read  i n  A 
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12 

0 

0 
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6 

0 

0 
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3 

0 

0 

Enable  PEs  3,4 
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8 

0 

0 
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15 

0 

0 

Enable  ALL 
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2 

0 

0 

£ i  12 

2 

6 

0 

Reserve  6  spaces  in  B 

n  20 

2 

1 

0 

Read  1st  col  value 

16 

3 

0 

0 

Increment  to  next  value 

$  10 

2 

1 

0 

Increment  MB 
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2 

1 

0 

Read  2nd  val 

ft  1 6 

3 

0 

0 

Increment  MAR 
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2 

1 

0 

Increment  MB 
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2 

1 

0 

Read  another  value 
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2 

1 

0 

Increment  MB 
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3 

0 

0 

Enable  PEs  3,4 
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7 

0 

0 
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2 

1 

0 

Read  in  2nd  col  of  B 
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3 

0 

0 

Increment  MAR 
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2 

1 

0 

Increment  MB 

cy:  20 

2 

1 

0 

Read  val 

ft  i6 

3 

0 

0 

Increment  MAR 

ft  io 

2 

1 

0 

Increment  MB 
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2 

1 

0 

Read  final  value 
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0 

0 

0 

Prepare  to  multiply 
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0 
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0 
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0 
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0 
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0 

Reset  B 
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0 
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3 

0 

0 
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1 

0 

0 
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0 

65 
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3 

0 

1st  of  add’l  values 
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1 

3 

0 

1  1 
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3 

0 

65 

3 

3 

0 

2nd  of  add'l  values 

0 

0 

0 

0 

Prepare  to  write  out  values 

100 

8 

0 

0 

Enable  PE  1 

15 

15 

0 

0 

C[  1  .  1  l 
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2 

0 

0 

Enable  PE  3 

15 

17 

0 

0 

C[1 .3] 
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10 

0 

0 

Enable  PEs  1,3 

1 1 

3 

0 

0 

21 

1 

0 

0 
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10 

3 

1 

0 
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3 

0 

0 

21 

1 

0 

0 

Send  2nd  col  val 
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2 

0 

0 

Enable  PE  3  only 

15 

16 

0 

0 

10 

3 

1 

0 

Advance  MR 

21 

1 

0 

0 

16 

3 

0 

0 

Increment  to  correct  locati 

10 

3 

1 

0 

Advance  MR 

21 

1 

0 

0 

Increment  to  correct  location 
Advance  MR 
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by 
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0 

0 

0 
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0 

0 

0 

0 
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0 

0 

0 
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p  cols  of  [ B ]  in  order,  and 
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0 
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0 
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0 
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0 
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0 

Prepare  to  load  matrices 
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20 

0 

0 

Allocate  space  for  20  values 

110 

0 

3 

2 

Read  [A] 
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0 

3 

2 

Echo 

110 

6 

2 

2 

Read  [B] 

111 

6 

2 

2 

Echo 

0 

0 

0 

0 

Partition  matrices 

100 

12 

0 

0 

Enab 1 e  1,2 

15 

0 

0 

0 

A[  1,1] 

100 

3 

0 

0 

Enable  3,4 

15 

2 

0 

0 

A[  2 , 1 ] 

100 

15 

0 

0 

Enable  ALL 

11 

1 

0 

0 

Se  t  MA  to  zero 

12 

1 

4 

0 

Allocate  space  for  4  values 

i  n  MA 

20 

1 

2 

0 

Read  1st  row 

100 

12 

0 

0 

Act ivate  only  1 ,2 
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4 

0 

0 

Increment  MAR 
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1 

2 

0 

Increment  LM 
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0 

Read  2nd  row 

100 
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0 

0 

Enable  All 

15 

6 

0 

0 

B  [  1  , 1  ] 

12 

2 
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0 

Reserve  apace  for  4  words 
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2 

4 

0 

Read  B 

0 

0 
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0 

Prepare  to  multiply 
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3 
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0 

Allocate  values  in  MR 

11 

1 
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0 
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0 

Set  LM  to  0 
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0 
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0 

Multiply  to  find  INPRD 
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0 

Reset  A 
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0 
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0 

Multiply  for  2nd  INPRD 

11 

1 

2 

0 

11 

2 

0 

0 

65 

2 

2 

0 

11 

1 

2 

0 

11 

2 
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0 
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Prepare  to  PRINT 

Prepare  to  PRINT 
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100  800 
15  10  0  0 

100  200 
15  12  0  0 

100  400 

11  3  2  0 

15  14  0  0 

100  15  0  0 

21  2  0  0 
0  0  0  0 
111  10  3  2 

255  000 


c[  i .  1 3 

C[  2 , 1  ] 


Send  2  values  over 
PRINT 

Print  Result 
CPHa  1  t 


